FDR-Control in Segmentation

# FDR-Control in Multiscale Change-point Segmentation

## Abstract.

Fast multiple change-point segmentation methods, which additionally provide faithful statistical statements on the number, locations and sizes of the segments, have recently received great attention. In this paper, we propose a multiscale segmentation method, FDRSeg, which controls the false discovery rate (FDR) in the sense that the number of false jumps is bounded linearly by the number of true jumps. In this way, it adapts the detection power to the number of true jumps. We prove a non-asymptotic upper bound for its FDR in a Gaussian setting, which allows to calibrate the only parameter of FDRSeg properly. Change-point locations, as well as the signal, are shown to be estimated in a uniform sense at optimal minimax convergence rates up to a log-factor. The latter is w.r.t. -risk, , over classes of step functions with bounded jump sizes and either bounded, or possibly increasing, number of change-points. FDRSeg can be efficiently computed by an accelerated dynamic program; its computational complexity is shown to be linear in the number of observations when there are many change-points. The performance of the proposed method is examined by comparisons with some state of the art methods on both simulated and real datasets. An R-package is available online.

Keywords: Multiscale inference; change-point regression; false discovery rate; deviation bound; dynamic programming; minimax lower bound; honest inference; array CGH data; ion channel recordings.

## 1. Introduction

To keep the presentation simple, we assume that observations are given by the regression model

 Yi=μ(in)+σεi,i=0,1,…,n−1, (1)

where are independent standard normally distributed, and . The mean-value function is assumed to be right-continuous and piecewise constant with segments , i.e.

 μ=K∑k=0ck1[τk,τk+1). (2)

Here the number of change-points is unknown, as well as the change-points , , with the convention that and . The (unknown) value of on the -th segment is denoted by and we assume for identifiability of . We stress, however, that much of our subsequent methodology and analysis can be extended to other models, e.g. for nonequidistant sampling points, when the observations come from an exponential family or more generally, errors obey certain moment conditions, and to dependent data. The latter case will be illustrated in Section 5.3 for the segmentation of ion channel recordings.

Estimation of and its change-points in this seemingly simple model (1) (and variations thereof) has a long history in statistical research (see e.g. (Carlstein et al., 1994; Csörgö and Horváth, 1997; Siegmund, 2013; Frick et al., 2014) for a survey). It has recently gained renewed interest from two perspectives, in particular. Firstly, large scale applications such as from finance (see e.g. (Inclán and Tiao, 1994; Bai and Perron, 1998; Lavielle and Teyssière, 2007; Spokoiny, 2009; Davies et al., 2012)), signal processing (see e.g. (Harchaoui and Lévy-Leduc, 2008; Blythe et al., 2012; Hotz et al., 2013)) or genetic engineering (see e.g. (Braun et al., 2000; Olshen et al., 2004; Zhang and Siegmund, 2007, 2012; Jeng et al., 2010; Siegmund, 2013)) call for change-point segmentation methods which are computationally fast, say almost linear in the number of observations. Secondly, besides of a mere segmentation of the data into pieces of constancy certain evidence on the number, locations and heights of these pieces which come with this segmentation is demanded.

Many state of the art segmentation methods which aim to meet the latter two goals are based on minimizing a penalized cost functional among different number of change-points and locations of change-points . For a cost function , which serves as goodness-of-fit measure of a constant function on an interval, and a penalty against over-fitting these approaches search for a solution of the global optimization problem

 minμK∑k=0C(Y⌈nτk⌉,…,Y⌈nτk+1⌉−1;ck)+γnf(K). (3)

Fast and exact algorithms for this kind of methods employ dynamic programming such as the optimal partitioning method (Jackson et al., 2005) and the Potts estimate (Boysen et al., 2009; Storath et al., 2014), who advocate the sparsest subset selection penalty

 f(K)=l0(μ)=K. (4)

For more general , see e.g. the segment neighbor method (Auger and Lawrence, 1989) or (Friedrich et al., 2008). More recently, Killick et al. (Killick et al., 2012) introduced a pruned dynamic program (PELT) with expected linear complexity mainly for and Du et al. (Du et al., 2015) used dynamic programming to compute the marginal MLE in a Bayesian framework. From a computational point of view, approaches of type (3) seem therefore beneficial. Nevertheless, the choice of and its associated balancing parameter in (3) is subtle. Birgè and Massart (Birgé and Massart, 2006) offer examples and discussion of this and other penalty choices, and Boysen et al. (Boysen et al., 2009) provide asymptotically optimal choices of , as . Zhang and Siegmund (Zhang and Siegmund, 2007, 2012) proposed a penalty depending on and additionally on distances between consecutive change-points.

In contrast to solving the global optimization problem in (3) another prominent class of methods is based on the idea to iteratively apply a local segmentation method to detect a single change-point. If such a change-point is detected on a segment, it is split into two parts and the same routine is applied to both new segments. The method stops if no further change-points are found. This approach, referred to as binary segmentation (BS), is certainly among the most popular ones for change-point segmentation, in particular in the context of the analysis of copy number variation data and related biostatistical issues. It has already been suggested in (Scott and Knott, 1974) and more recently related methods have been proposed, such as circular binary segmentation (CBS) (Olshen et al., 2004; Venkatraman and Olshen, 2007) and wild binary segmentation (WBS) (Fryzlewicz, 2014). For these approaches, the to be specified parameter among others is the probability of including a false change-point in one iteration. Therefore, local error control can be provided, but the overall uniform control on the error to include or exclude wrong segments appears to be often difficult for these methods, as well. A notable exception is (Fryzlewicz, 2014, Theorems 3.2 and 3.3), however, these bounds depend on constants which are difficult to specify.

However, given the data at hand, significant conclusions on the number, location and size of the change-point function are not an easy task for the above mentioned methods as these require uniform finite sample error bounds, for all these quantities, simultaneously. A similar comment applies to other global segmentation methods which rely on an approximation of the nonconvex penalty in (4) including lasso-type techniques possibly together with post filtering to further enhance sparseness, see e.g. (Tibshirani et al., 2005; Friedman et al., 2007; Harchaoui and Lévy-Leduc, 2010).

Frick et al. (Frick et al., 2014) suggest a hybrid method, simultaneous multiscale change-point estimator (SMUCE), which tries to address both tasks (computationally fast while still obeying finite sample uniform error control) by minimizing the number of change-points under a local multiscale side-constraint, see also (Boysen et al., 2009; Davies et al., 2012) for related estimators. The side-constraint is based on a simultaneous multiple testing procedure on all scales (length of subsequent observations) which employs a scale calibrating penalty (Dümbgen and Spokoiny, 2001). It can be shown that for the resulting segmentation the number of change-points is not overestimated at a pre-defined probability, (i.e. family-wise error rate, FWER). This provides a direct statistical interpretation. In fact, the error of including false positives provided by SMUCE has exponential decay,

 P{^K≥K+j}≤α⌈j/2⌉S,j=1,2,… (5)

(see (Frick et al., 2014)), which in particular controls the overestimation of the number of true change-points ( in (5))

 P{^K>K}≤αS. (6)

Moreover, it can be shown that the method is able to detect the true over a large range of scales with minimax detection power (Frick et al., 2014, Theorem 5). However, according to (6), in particular in situations with low signal to noise ratio (SNR) or with many change-points compared to the number of observations, this error control necessarily leads to a conservative estimate of in (2), i.e. with fewer change-points than the true number . Therefore, in this paper we offer a strategy to overcome this drawback which might be beneficial also for other related methods. This is based on the control of the false discovery rate (FDR) (Benjamini and Hochberg, 1995) instead of the FWER control in (6). Despite of the huge literature about change-point segmentation and detection, there is only a small number of papers addressing the FDR issue in this context. Early references include (Tibshirani and Wang, 2008) which proposed a multiple stage procedure, and gave empirical evidence for the FDR control, and (Efron and Zhang, 2011) which considered a local FDR based approach for the copy number variation analysis of multiple samples in cancer genetics. Recently, Hao et al. (Hao et al., 2013) proved the FDR control of the screening and ranking algorithm (Niu and Zhang, 2012) for a restricted definition of FDR, and Cheng and Schwartzman (Cheng and Schwartzman, 2015) provided an asymptotic control of FDR of a smoothing based approach. For further discussion see Section 1.2.

### 1.1. FDRSeg

In this work, we will present FDRSeg, which controls the FDR of the whole segmentation. The significance statement given by the method is quite intuitive and also holds for a finite number of observations. This reveals the contribution of this work as threefold: First, the new method overcomes the conservative nature of SMUCE and variants (see (Chen et al., 2014)) while maintaining a solid statistical interpretation. In doing this, we provide a general framework how to combine FDR-control with global segmentation methods in a multiscale fashion, which is of interest by its own. Second, various optimality statements are provided, and all results hold in a non-asymptotic manner uniformly over a large class of piecewise constant functions in model (2). Third, FDRSeg is shown to be computable often in almost linear time. In summary, FDRSeg is a hybrid segmentation technique, combining statistical efficiency and fast computation while providing solutions with preassigned statistical accuracy.

Before going into details, we illustrate our approach by the example in Figure 1. We employed the blocks signal (Donoho and Johnstone, 1994) with Gaussian observations of standard deviation (with integrated SNR ). Very naturally we declare such discoveries (estimated change-points) true if they are “close” (to be specified later) to true change-points. In this example FDRSeg () detects all the change-points correctly, while SMUCE finds only out of , due to its requirement to control the FWER in (6). This remains valid until is increased to . For larger FDRSeg overestimates the number of change-points. For example, if it finds one additional false change-point (at , marked by a vertical red line and an associated interval defined in (7), in the bottom panel) besides all the true ones. The proportion between false and all discoveries plus one (number of segments) is hence . Later we will show that FDRSeg is indeed able to control this proportion in expectation at any predefined level uniformly over all possible change-point functions . For the other direction, the largest for which FDRSeg underestimates the number of change-points is (see the third panel for ; the missing change-point is marked by a vertical green line). That is, FDRSeg estimates the correct number of change-points for the entire range of and hence appears to be remarkably stable in terms of the control parameter . This will be investigated more detailed later.

### 1.2. Multiplicity and FDR control

For our purpose it is helpful to interpret the “detection part” of the multiple change-point regression problem as a multiple testing problem. In the literature methods with this flavor often consider multiscale local likelihood tests. Whereas local tests for the presence of a change-point on small systems of sets (e.g. the dyadics) of the sampling points can be efficiently computed they may have low detection power and highly redundant systems such as the system of all intervals have been suggested instead (Siegmund and Yakir, 2000; Dümbgen and Spokoiny, 2001; Frick et al., 2014). See, however, (Walther, 2010; Rivera and Walther, 2013) for less redundant but still asymptotically efficient systems. It was pointed out in (Siegmund et al., 2011) that classical FDR for redundant systems might be misleading, because such local tests are highly correlated and consequently tests on nearby intervals likely reject/accept the null-hypothesis together, see also (Benjamini and Yekutieli, 2001; Guo and Sarkar, 2013) for a general discussion of this issue. Siegmund et al. (Siegmund et al., 2011) therefore suggest to test for constancy on subintervals and to group the nearby false (or true) rejections, and count them as a single discovery, which allows to control the FDR group-wise. In our approach, we circumvent this, but still are able to work with redundant systems, because instead we perform a multiple test for the change-points directly, i.e. we treat the multiple testing problem

 Hi:in is not a change-point, v.s. Ai:in is a change-point,i=0,…,n−1.

It remains to define a true/false discovery. This is done by identifying a rejection as a true discovery if it is “close” to a true change-point. To be specific, let be rejections (i.e. estimated change-points), and the estimated number of change-points. For each , we classify as a true discovery if there is a true change-point lying in

 [⌈n(^τi−1+^τi)/2⌉n,⌈n(^τi+^τi+1)/2⌉n) (7)

where and ; otherwise, it is a false discovery, see again the bottom panel in Figure 1. Similar to (Benjamini and Hochberg, 1995), we then define the false discovery rate (FDR) by

 FDR:=E[FD^K+1], (8)

where FD is the number of false discoveries in the above sense. Note, that the above notion of true/false discoveries is well defined: (a) every estimated change-point is either true or false, but not both; (b) corresponding to each true change-point there is at most one true discovery, because the intervals (7) are disjoint for different . We stress that no additional assumption, such as the sparsity of change-points, the minimal length of segments, is needed for this definition. It automatically adapts to the individual length of segments, in particular for the region of rapid changes, such as subgating characteristic of ion channel recordings (Hotz et al., 2013). To some extent it neglects the accuracy of jump locations, especially when the change-points are far apart located. In this sense, this definition primarily focuses on the correct number of change-points rather than the locations or the sizes of the segments. In the following we will see however, that our method will also have a high accuracy in estimating the locations. To this end we will consider the following evaluation measure

 d(μ,^μ):=max0≤i≤K+1min0≤j≤^K+1∣∣τi−^τj∣∣, (9)

for and , with the convention that and . In addition, we will examine the -risk () of .

### 1.3. Plan of the paper

The rest of the paper is organized as follows. In Section 2, we introduce the new segmentation method FDRSeg and show its FDR control. In Section 3 we prove a finite sample exponential deviation bound for the estimation error of the jump locations in (9) (see Theorem 3.1). From this we derive that the locations are estimated at the optimal sampling rate up to a log-factor uniformly over a large class of sequences of step functions with possibly increasing number of change-points, minimal scale of order , and non-vanishing minimal jump height. Further, for the estimate we show that its -risk () is of order (see Theorem 3.3) in the class of step functions with minimal scale and jump bounded away from zero and bounded jump size. In Theorem 3.4 we prove a lower bound for the -risk which reveals FDRSeg to be minimax optimal up to a log-factor in this class.

In Section 4 we will develop a pruned dynamic program for the computation of FDRSeg. It has linear memory complexity, and linear time complexity for signals with many change-points, in terms of the number of observations. The accuracy and efficiency of FDRSeg is examined in Section 5 on both simulated and real datasets. Compared to state of the art methods, FDRSeg shows a high power in detecting change-points and high efficiency for signal recovery on various scales, simultaneously. As demonstrated on ion channel recordings, a modification to dependent data (D-FDRSeg) reveals relevant gating characteristics, but avoids at the same hand spurious change-points which are misleadingly found without adaptation to the correlated noise. The paper ends with a conclusion in Section 6.

An implementation of FDRSeg is provided in R-package “FDRSeg”, available from http://www.stochastik.math.uni-goettingen.de/fdrs.

## 2. Method and FDR control

Now we will give a formal definition of the FDRSeg. To simplify, we assume that the noise level is known. For methods to estimate , see (23) or e.g. (Rice, 1984; Hall et al., 1990; Dette et al., 1998) among many others. Assume that is given by model (1). For an interval we consider the multiscale statistic with scale calibration (motivated from (Frick et al., 2014))

 TI(Y,c)=max[i/n,j/n]⊂I∣∣∑jl=i(Yl−c)∣∣σ√j−i+1−pen(j−i+1#I), (10)

where is a real number, the penalty term for the scale and the number of sampling points in (scale). The first term in (10) describes how well the data can be locally described by the constant on the interval , and the second term (so called scale calibration) is designed to balance the detection power among different scales (i.e. lengths of intervals), see (Dümbgen and Spokoiny, 2001; Frick et al., 2014) for further details. Thus, examines the hypotheses that on the interval simultaneously over all intervals , i.e. in particular on all scales of .

For , let us introduce local quantiles , , by

 qα(m):=min{q:P{TI(ε,¯εI)>q}≤α}, (11)

where is standard normally distributed, , and a fixed interval with . Obviously, does not depend on the choice of if , which justifies the definition (11).

###### Remark 2.1.

As a direct consequence of (Dümbgen and Spokoiny, 2001) (see also (Dümbgen and Walther, 2008; Frick et al., 2014)) the limit distribution of is finite almost surely and is continuous (Dümbgen et al., 2006), as . For every , the values ’s are therefore uniformly bounded for all . In practice, ’s are obtained by Monte-Carlo simulations. Note, that this needs only to be done once and can be stored in a table, as it does not depend on the data nor the signal .

For our purpose we have to introduce the set of step functions restricted to the multiscale side-constraint induced by (10) and (11) (for fixed )

 Ck={μ=k∑i=0ci1Ii:TIi(Y,ci)−qα(#Ii)≤0∀i=0,1,…,k}. (12)

The estimated number of change-points according to FDRSeg will then be given by

 ^K:=min{k:Ck≠∅}. (13)

The will be always an integer between and , since . The FDRSeg estimate is given by

 ^μ:=argminμ∈C^Kn−1∑i=0(Yi−μ(in))2, (14)

that is, the constrained maximum likelihood estimator within . The intuition behind is to search for the simplest step function (with complexity measured by number of change-points) which lies in the multiscale constraint in the form of (12).

The main result of this section is that our estimator is able to control the FDR in the sense of (8) by choosing the local levels for intervals of length in (11) properly.

###### Theorem 2.2.

Let be observations from model (1), and . Then FDRSeg in (12)-(14) with in (11) controls the FDR defined in (8),

 FDR^μ(α)≤2α1−α=:β. (15)
###### Proof.

See Appendix A.1. ∎

###### Remark 2.3 (Discussion of the bound).

Various simulation studies (not displayed) suggest even the bound , improving (15) by a factor of 2. Although we were not able to prove this, we stress that this might be useful for practical purpose to select and interpret . For example in Figure 2 we display results for the teeth signal (see Figure 6), where the FDR is estimated by the empirical mean of 1,000 repetitions with . It shows that the bound (15) (dashed line) is good when is small, and gets worse as increases.

###### Remark 2.4 (Choice of parameter for FDRSeg).

Note that Theorem 2.2 provides a statistical guidance for the choice of the only parameter for FDRSeg. To calibrate the method for given , we simply rewrite (15) into

 α=β2+β,

which is roughly, for small , see Figure 3. In practice, one could even use as discussed in Remark 2.3. We further stress that FDRSeg is actually robust to the choice of (or ), as we have already seen in Figure 1.

###### Remark 2.5 (Comparison of SMUCE and FDRSeg).

Let us stress some notable differences to SMUCE (Frick et al., 2014), which is based on restricting possible estimators to

 C0k={μ=k∑i=0ci1Ii:maxi=0,…,kT0Ii(Y,ci)≤~qαS},

where is as in (10), with penalty instead, and

 Missing or unrecognized delimiter for \left (16)

Firstly, this penalty term underlying SMUCE on the interval only relates the ratio between the number of observations in and all the observations, while that of FDRSeg relies on the ratio between the number of observations in and the corresponding segment length of . This modification has a flavor similar to the refined Bayes information criterion type of penalty in (Zhang and Siegmund, 2007). Secondly, the parameter of SMUCE ensures that the true signal lies in the side-constraint with probability at least . In contrast, FDRSeg considers constant parts of the true signal individually, guaranteeing that the mean value of each segment lies in its associated side-constraint in with probability at least . This makes it much less conservative, and its error controllable in terms of FDR (see Theorem 2.2). This is a key idea underlying FDRSeg. For an illustration of this effect see Figure 4. Thirdly, the thresholding underlying SMUCE is based on a global quantile. In contrast, for FDRSeg, the quantiles in (11) are locally chosen according to the scale, revealing the resulting method less conservative. Note, that in (11) and in (16) are even different when and . Simulations show that for every and , see Figure 5. This again highlights that FDRSeg detects more change-points than SMUCE.

In situations with many change-points or low SNR, to overcome the conservative nature of SMUCE, the significance level in (6) to control the overestimation error, has been suggested to be chosen close to one to produce an estimate with good screening properties (Frick et al., 2014), although then the confidence statements in (5) and (6) becomes statistically meaningless. It follows from the arguments above that the parameter of FDRSeg relates to roughly by

 1−(1−α)K+1≈αS (17)

because the probability of coverage of the true signal by is , where is the true number of change-points. This is confirmed by simulations. For example, consider the recovery of a teeth signal (adopted from (Fryzlewicz, 2014)) with from observations contaminated by standard Gaussian noise, see Figure 6. In Figure 7, the histogram of estimated number of change-points by SMUCE () and FDRSeg () are shown in white bars from 1,000 repetitions. It can be seen that SMUCE () seriously underestimates the number of change-points, while FDRSeg estimates the right number of change-points with high probability. If we adjust according to (17), i.e. , this leads to a significant improvement of detection power of SMUCE, as is shown by the corresponding histogram of estimated number of change-points in grey bars (left panel in Figure 7), however, at the expense of any reasonable statistical error control, i.e. the control of overestimating the true for SMUCE becomes increasingly more difficult as gets larger. On the other hand, FDRSeg adapts to automatically, and works well with a choice of small values of in (15). Moreover, concerning the accuracy of locations, the medians of , see (9), of SMUCE () and FDRSeg () have been found as and , respectively, while such medians conditioned on have the same value, . This confirms the visual impression when comparing the two lower panels in Figure 6: Local thresholding in (11) and (12) makes an important difference to SMUCE.

## 3. Risk bounds for FDRSeg

In order to state uniform results on the -risk of and on the simultaneous estimation of the change-point locations, we define the smallest segment length of a step function in (2) by

 λμ:=min0≤k≤K|τk+1−τk|,

and the smallest jump size of by

 Δμ:=min1≤k≤K|ck−ck−1|,if Kμ≥1.

The subscript will be suppressed in the following, if there is no ambiguity. Note, that no method can recover arbitrary fine details measured in terms of and for given sample size . More precisely, the detection boundary for testing against a zero signal asymptotically is given as

 Δnσ√|In|≥√2log|In|−1n+a−1n with an=o(√n), (18)

see  (Chan and Walther, 2013; Frick et al., 2014). It is worth noting that FDRSeg detects such signals (18) with asymptotic power 1, provided that the level is bounded away from 0, as . The proof is omitted because it is similar to (Frick et al., 2014, Theorem 5).

In the following we will show how and determine the detection and estimation difficulty for step functions with multiple change-points (cf. Theorems 3.1 and 3.3) in a non-asymptotic way. The following exponential bound for the estimated locations provides a theoretical justification of the previous empirical findings (see also Section 5) of the good detection and estimation performance of FDRSeg.

###### Theorem 3.1.

Assume the change-point regression model (1) with signal in (2), and let , , and defined in (9) Then for the FDRSeg in (14), the following statements are valid:

• It holds for any that

 P{d(μ,^μ)>δ}≤ 2Kexp⎛⎜⎝−18(Δ√nδλ2σ−maxm≤nqα(m)−√2logeδλ)2+⎞⎟⎠ +2Kexp(−nΔ2δλ8σ2). (19)
• Let , with , and assume

 δλ≥8((√γ+1)√logn+2√logK)2nmin{(Δ/σ)2,1}(1+ϵ),

for some positive independent of , then

 limn→∞P{d(μ,^μn)>δ}=0.

In particular, for every , we have

 limn→∞supμ∈AnP{d(μ,^μn)>Clognn}=0,

where

 An:={step signal μ in (???), s.t.λμ≥2Clognn and Δμ≥σ,%ifKμ≥1}.
###### Proof.

See Appendix A.2. ∎

###### Remark 3.2.

It is worth noting that the first term in (19) is always greater than the second one, and that the influence of (or equivalently , see (15)) only appears in , which is bounded by , see Lemma A.3. Hence, for a fixed regression function , FDRSeg is able to estimate the jump locations correctly at a rate. Note, that this is the optimal sampling rate (up to a log-factor). It improves several results obtained for other methods, e.g. in (Harchaoui and Lévy-Leduc, 2010) for a total variation penalized estimator a rate has been shown. Theorem 3.1 also applies for a sequence of signals with , , and . For example, it shows that FDRSeg detects jump locations at a rate for with possibly unbounded , , and bounded . The same rate is shown for WBS in (Fryzlewicz, 2014), however, under the additional assumption of bounded or an oracle choice of the threshold depending on the underlying signal.

Next we will study the convergence rate of FDRSeg in terms of -risk. By we denote the largest jump size of a step function , that is,

 ~Δ=~Δμ:=max1≤k≤Kμ|ck−ck−1|if Kμ≥1.

Let us introduce the following class of step functions, , with bounded minimal segment length and jump size:

 Bν,ϵ,L:={step signal μ:λμ≥ν, and ϵ≤Δμ≤~Δμ≤L if Kμ≥1}, (20)

for , and . Within such classes, we obtain a uniform control on the -risk of FDRSeg for .

###### Theorem 3.3.

Assume is defined in (20), and the FDRSeg estimator with from observations in model (1).

• If and

 αn=o(1√n(lognn)min{1/2,1/p}),

then

 limsupn→∞supμ∈Bν,ϵ,LE[∥∥^μn,αn−μ∥∥Lp](νϵ2nσ2logn)min{1/2,1/p}≤25L,

for any , , , and .

• If with constants , , and and

 αn=o(1nγ+1/2(lognn1−γ)min{1/2,1/p}),

then

 limsupn→∞supμ∈Bνn,ϵ,LE[∥∥^μn,αn−μ∥∥Lp](ϵ2n1−γσ2logn)min{1/2,1/p}≤34L,

for any , , , and .

###### Proof.

See Appendix A.3. ∎

In fact, the rates above are minimax optimal, possibly up to a log-term.

###### Theorem 3.4.

Assume the change-point regression model (1), and is defined in (20).

• There is a positive constant , such that

 inf^μnsupμ∈Bν,ϵ,LE[∥^μn−μ∥Lp]≥C(σ2n)min{1/2,1/p},

for any , , , and .

• If with constants , , then there is a positive constant , such that

 Missing or unrecognized delimiter for \left

for any , , , and .

See Appendix A.4

## 4. Implementation

It will be shown that FDRSeg can be efficiently computed by a specific dynamic programming (DP) algorithm, which is significantly faster than the standard DP. For convenience let us introduce

 I([in,jn))={1 if T[i/n,j/n)(Y,c)≤qα(j−i) for some constant c,0 otherwise.

We first consider the computation of , see (13). Let be the estimated number of change-points by FDRSeg when applying to , i.e.,

 ^K[i]:=min{k:max0≤j≤kTIj(Y,cj)−qα(#Ij)≤0, for % some μ=k∑j=0cj1Ij with k⨄j=0Ij=[0,in)min{k:max0≤j≤kTIj(Y,cj)−qα(#Ij)≤0, for some μ=k∑j=0cj1Ij with k⨄j=0Ij=[0,in)}

for , where denotes disjoint union. Then the estimated number of change-points in (13) is given by . It can be shown that the following recursive relation

 Missing or unrecognized delimiter for \left (21)

holds for . Eq. (21) is often referred to as Bellman equation (Bellman, 1957), also known as optimal substructure property in computer science community (Cormen et al., 2009). It justifies the use of dynamic programming (Bellman, 1957; Bellman and Dreyfus, 1962) for computing FDRSeg. In this way, the computation of is decomposed into smaller subproblems of determining ’s. For each subproblem, it boils down to checking the existence of constant functions which satisfy the multiscale side-constraint on i.e. . The is computed, via the recursive relation (21), as increases from to . For each , this involves the search space of , which increases as approaches . However, some of such searches are, actually, not necessary and can be pruned. This can be seen by rewriting the recursive relation in terms of the number of change-points. Let and . For let

 rk :=max{j:T0[i/n,j/n](Y,c)≤maxmqα(m) for some i∈Ak−1,c∈R}, Ak :={i∈Bk−1∩[1,rk]:I([j/n,i/n))=1 for some j∈Ak−1}, Bk :=Bk−1∖Ak.

Then with . The reason for introducing is that there is no need to consider larger intervals if the multiscale side-constraint on an interval does not allow a constant signal even with the maximal penalty and the maximal quantile. Now for each we only need to search in a subset of , where . The complexity for computing is bounded from above by

 O⎛⎝^K∑k=0(#Ak)(rk+1−minAk−#Ak2)2⎞⎠≤O(nmax0≤k≤^K(rk+1−minAk−#Ak2)2). (22)

The value depends on the signal and the noise. If the signal has many change-points and segments have similar lengths, it is probably a constant independent of . The higher the noise level, the larger it might be. In such situation, the computation complexity is linear, although in the worst case it can be cubic in .

Indeed, the searches of and the maximum likelihood estimate can be done simultaneously, if we record the likelihood for each point . The complexity is again bounded above by (22) but with a possibly larger constant. The memory complexity of the whole algorithm is linear, i.e. . We omit technical details, and provide the implementation in the R package “FDRSeg” (http://www.stochastik.math.uni-goettingen.de/fdrs).

## 5. Simulations and Applications

### 5.1. Simulation study

We now investigate the performance of FDRSeg under situations with various SNRs and different number of change-points, and compare it with PELT (Killick et al., 2012), BS (Scott and Knott, 1974), CBS (Olshen et al., 2004; Venkatraman and Olshen, 2007), WBS (Fryzlewicz, 2014), and SMUCE (Frick et al., 2014). As mentioned in Section 1, these methods represent a selection of powerful state of the art procedures from two different view points: first, exact and fast global optimization methods based on dynamic programming, including PELT, and SMUCE; second, fast greedy methods based on local single change-point detection, including BS, CBS and WBS. In addition, we also include two recent fully automatic penalization methods, specifically tailored to jump detection. The first is based on a modified Schwarz information criterion (SIC) (Zhang and Siegmund, 2007), referred to as mSIC, which assumes the number of change-points is bounded. The second is a recent variant (Zhang and Siegmund, 2012), referred to as mSIC2, which is primarily designed for many change-points. Concerning implementation, we use the CRAN R-packages “PSCBS” for CBS, “wbs” for BS and WBS, “changepoint” for PELT, and an efficient implementation in our R-package “FDRSeg” for SMUCE, see http://www.stochastik.math.uni-goettingen.de/fdrs. For both SMUCE and FDRSeg, we estimate the -quantile thresholds by 5,000 Monte-Carlo simulations. The penalty is chosen for PELT, which is dubbed by “SIC1” in the codes provided by its authors, and works much better than the default choice. If we identify a change-point with two parameters (location and jump-size), this is the same as the SIC. We use the automatic rule, strengthened SIC, recommended by the author for WBS. The default parameter setting provided in the packages was used for BS and CBS. For mSIC and mSIC2, maximum likelihood estimates are first computed by dynamic programming (see (Friedrich et al., 2008)), for each fixed number of change-points up to some prechosen constant , and then the optimal solutions are found within such maximum likelihood estimates, according to criteria in (Zhang and Siegmund, 2007) and (Zhang and Siegmund, 2012), respectively. Thus, their computation complexity depends increasingly on . In all simulated scenarios, we assume that the noise level is known beforehand. For quantitative evaluation, we will use the mean integrated squared error (MISE), the error of estimated locations , see (9), the FDR defined in (8) and the V-measure (Rosenberg and Hirschberg, 2007), a segmentation evaluation measure, which takes values in , with a larger value indicating higher accuracy. It is based upon two criteria for clustering usefulness, homogeneity and completeness, which capture a clustering solution’s success in including all and only data points from a given class in a given cluster. In particular, a V-measure of shows a perfect segmentation. All the experiments are repeated 1,000 times.

#### Varying noise level

Let us consider the impact of different noise levels. To this end, we use the mix signal (adopted from (Fryzlewicz, 2014)), see Figure 9, with additive Gaussian noise, which is a mix of prominent change-points between short intervals and less prominent change-points between longer intervals. The noise level varies from to , and the number of observations . For SMUCE and FDRSeg, we choose the same parameter . As in Figure 8, FDRSeg outperforms others in all noise levels, in terms of V-measure, MISE, , and detection power measured by the average number of detected change-points. As indicated by the number of detected change-points and MISE, PELT ranks second followed by WBS, then mSIC, CBS, SMUCE and lastly BS. The same order of performance is also seen from V-measure and up to , but SMUCE deteriorates slower as noise level increases and achieves a better V-measure and than CBS when and than WBS at . The mSIC2 performs comparably to mSIC when , while deteriorating faster as increases, similar to BS when . It is worth noting that the empirical FDR of FDRSeg is around , below and the theoretical bound in Theorem 2.2 (indicated by the dashed horizontal line in the lower-left panel). The CBS has the second largest empirical FDR, while that of PELT, SMUCE, mSIC, mSIC2, BS and WBS is almost zero. Once the quantiles for SMUCE and FDRSeg are simulated, they can be stored and used for later computations, which are therefore excluded from the recorded computation time. The computation time of FDRSeg is similar to the fastest ones, namely PELT, BS and SMUCE, at and increases with the noise level . The FDRSeg is faster than WBS and CBS in all scenarios. As mentioned earlier, the computation time of mSIC and mSIC2 depends on the upper bound of the possible number of change-points, which is set to 100. To have a closer examination, we also illustrate histograms of the locations of change-points, for in Figure 9. In this situation, the FDRSeg has always the largest detection power over all change-point locations.

The constant signal with no change-point serves as an example to examine whether FDRSeg detects artificial jumps. Figure 10 shows the comparison between SMUCE () and FDRSeg () when . Remarkably, the difference between the two estimators is negligible and the overestimation by FDRSeg number of jumps is quite insignificant.

#### Varying frequency of change-points

In order to evaluate the detection power as increases, we employed the teeth signal (see Figure 6) with ,000, and , , as its integrated SNR remains the same for different number of change-points. The same parameter is chosen for SMUCE and FDRSeg. The results are summarized in Figure 11. The FDRSeg, mSIC, and PELT perform comparably well in all situations in terms of number of detected change-points, V-measure, MISE and , while FDRSeg is slightly better in terms of accuracy of change-point locations at . As shown by V-measure, CBS and WBS fail when , BS fails when , and SMUCE and mSIC2 deteriorate at . A similar trend can also be seen for the number of estimated change-points, MISE, and . It is interesting that the empirical FDR of FDRSeg gets closer to the theoretical bound as , indicating that this gets sharper for increasing , although we have no proof for this. The empirical FDR of CBS is large when the change-points are sparse, and decreases as increases, while PELT, SMUCE, mSIC, mSIC2, BS and WBS have a relatively small FDR close to zero in all cases. The computation time of FDRSeg decreases as increases, and is comparable to the fastest ones (SMUCE, PELT and BS), when . The computation time of mSIC and mSIC2 is the slowest, since we have to search among maximum likelihood estimates with all possible numbers of change-points, i.e. .

### 5.2. Array CGH data

Identifying the chromosomal aberration locations in genomic DNA samples is crucial in understanding the pathogenesis of many diseases, in particular, various cancers. Array comparative genomic hybridization (CGH) provides the means to quantitatively measure such changes in terms of DNA copy number (Pinkel et al., 1998). The statistical task is to determine accurately the regions of changed copy number, and the model (1) and variants thereof has been commonly studied in this context (Olshen et al., 2004; Zhang and Siegmund, 2007; Tibshirani and Wang, 2008; Jeng et al., 2010). We compared FDRSeg with SMUCE, and CBS, which is designed for the analysis of array CGH data, on the Coriel data set from (Snijders et al., 2001). Following (Olshen et al., 2004) outliers have been removed before segmentation. The noise level is estimated by an interquartile range (IQR) applied to local differences (see (Davies and Kovac, 2001))

 ^σ=1.349√2(^q0.75−^q0.25), (23)

where is the empirical -quantile of . The CBS was computed using default parameters provided in the package “PSCBS”. The estimated copy number variations by each method are plotted with the data (points) for cell line GM01524 in Figure 12. The SMUCE () detects 8 change-points, while FDRSeg () finds 5 more change-points, which are all found by CBS as well. The latter provides the largest number of change-points, 17, 4 of them are not supported by FDRSeg (marked by ‘x’). We stress that, there is biological evidence that such small jumps might be artifacts due to genomic waves, see (Diskin et al., 2008). The model (1) apparently does not take such waves into account. Apart from this possible modeling error, it is worth noting that, by Theorem 2.2, among 13 change-points by FDRSeg there are on average at most 0.7 false ones. In order to study the robustness against such a modeling error, we consider step functions in (1) with periodic trend component, as in (Olshen et al., 2004; Zhang and Siegmund, 2007), i.e.

 Yi∼N(μ(i/n)+0.25bsin(aπi),σ2),i=0,1,…,n−1, (24)

where , , and has change-points with values on each segment, respectively. The FDRSeg with is applied to the signal (24) within a range of and . The frequency of detecting the right number of change-points, together with the average of , in 1,000 simulations is given in Figure 13. It shows that FDRSeg is robust within a large range of local trends, and only includes false positives when the trend becomes large and highly oscillating.

### 5.3. Ion channel idealization

Being prominent components of the nervous system, ion channels play major roles in cellular processes (Hille, 2001), which are helpful in diagnosing many human diseases such as epilepsy, cardiac arrhythmias, etc. (Kass et al., 2005). The data analysis is to obtain information about channel characteristics and the effect of external stimuli by monitoring their behavior with respect to conductance and/or kinetics (Chung et al., 2007). The measuring process involves an analog low-pass filter prior to digitization. As suggested by (Hotz et al., 2013), hence a realistic model for observations is

 Yi=(ρ∗μ)(iϑ)+~εi, (25)

where is the sampling rate, and the convolution kernel of the low-pass filter has compact support in an interval of length , such that . Being the independent and identically distributed (i.i.d.) Gaussian noise after the low-pass filter , the ’s are still Gaussian with mean zero, but are correlated now.

As mentioned earlier in Section 1, FDRSeg can be extended to more general models than (1). We illustrate this for the present case of colored noise. To this end, we modify FDRSeg which explicitly takes into account the dependence structure of the noise in (25). This requires to adjust the definition of quantiles by using dependent Gaussian random variables, see (11). Note that the dependence structure is completely known from the kernel , so the modified quantiles can also be estimated via Monte-Carlo simulations. In order to analyze the data properly, we observe that is constant on if is constant on . Thus we consider only intervals contained in in the multiscale side-constraint (12) instead of all subintervals of , for . By incorporating these two modifications, we obtain a modified version of FDRSeg adjusted to this dependency, D-FDRSeg. For comparison, we consider the jump segmentation by multiresolution filter (J-SMURF) estimator (Hotz et al., 2013). The implementation of J-SMURF is provided in R-package “stepR”, available from CRAN. As in (Hotz et al., 2013), the significance level of J-SMURF is set to 0.05. The significance parameter of D-FDRSeg is also chosen as 0.05. The noise level, i.e. the standard deviation of , is estimated by (23) from the undersampled data .

In order to explore the potential of D-FDRSeg, we first carried a validation study on simulated data. Mimicking various dynamics of ion channels, we choose the truth in (25) by a simulated continuous time two-state Markov chain with different transition rates for 1 s. The true signal was tenfold oversampled at 100 kHz, and added by Gaussian white noise. Then, a digital low-pass filter with kernel in (25) was applied, and the data with 10,000 points were finally obtained after a subsampling at 10 kHz. The noise level was chosen such that SNR equals to 3. All the parameters above are typical for a real experimental setup (see (VanDongen, 1996; Hotz et al., 2013) for further details). The average of for J-SMURF, FDRSeg, and D-FDRSeg in 100 simulations is given in Figure 14. As to be expected, FDRSeg detects a large amount of false positives due to violation of the independence assumption of the noise (25), while D-FDRSeg with the dependence adjustment corrects for this. It shows a higher accuracy, and a higher detection power than J-SMURF, over all transition rates. We further compare D-FDRSeg with J-SMURF on experimental data: a characteristic conductance trace of gramicidin A (provided by the Steinem lab, Institute of Organic and Biomolecular Chemistry, University of Göttingen) with a typical SNR, and ms, see Figure 15. The J-SMURF detects only 8 change-points, while FDRSeg suggests 5 additional ones, i.e. 13 in total, all of which are visually reasonable. This illustrates the ability of FDRSeg to detect change-points simultaneously over various scales, as it is required for the investigated gramicidin channel (see (Hotz et al., 2013) for an explanation).

## 6. Conclusion and discussion

In this work we proposed a multiple change-point segmentation method FDRSeg, which is based on the relaxation of FWER to FDR. By experiments on both simulation and real data, FDRSeg shows high detection power with controlled accuracy. A theoretical bound is provided for its FDR, which provides a meaningful interpretation of the only user-specified parameter . In addition, we have shown that jump locations are detected at the optimal sampling rate up to a log-factor. Concerning the signal, i.e. both jump locations and function values, the convergence rate of the estimator is minimax optimal w.r.t. -risk () up to a log-factor. This result is over classes of step signals with bounded jump sizes, and either bounded, or possibly increasing, number of jumps.

Our method is not confined to i.i.d. Gaussian observations, although we restricted our presentation to this in order to highlight the main ideas more concisely. Obviously, it can be extended to more general additive errors, because the proof of Lemma A.1 only relies on Gaussianity for the independence of the residuals and the mean. In the case of different models, e.g. exponential family regression, we believe that one can argue along similar lines as in the proof of Theorem 2.2, but results will only hold asymptotically. This, however, is above the scope of the paper, and postponed to further research. Also, as we have applied the CBS outlier smoothing procedure to the array CGH data, it might be of interest to have more robust versions of FDRSeg. To this end, e.g. local median, instead of local mean, might provide useful results. Alternatively, one may transform this into a Bernoulli regression problem (see (Dümbgen and Kovac, 2009; Frick et al., 2014)), which might be interesting for further research. In the paper, we also suggested a modification of FDRSeg for dependent data, which shows attractive empirical results. It would be of interest to study this modified estimator from a theoretical point of view as well.

Acknowledgement. The authors thank Florian Pein, and Inder Tecuapetla for helpful discussions, and the Steinem lab (Institute of Organic and Biomolecular Chemistry, University of Göttingen) for providing the ion channel data.

## Appendix A Technical proofs

### a.1. Proof of Theorem 2.2

The proof of Theorem 2.2 relies on two lemmata. As a convention, all results are concerning the FDRSeg in (14) without explicit statement. The first one gives a bound for the expected number of false discoveries (FD) given no true discoveries (), see Section 1 for the definitions.

###### Lemma A.1.

Under above notations, we have for

 E[FD(α)|TD(α)=0]≤2α1−3α=:G(α).
###### Proof.

Note that it suffices to prove the result for a constant signal, which we assume w.l.o.g. to be constant zero. The proof is then based on the following observation. Assume there exists an estimate with segments which fulfills the multiscale side-constraint in (12). Then, the FD of FDRSeg is bounded by , since it minimizes the number of change-points among all nonempty ’s. We will prove the result by constructing such an estimate and show that . The estimate is given by an iterative rule to include change-points until the multiscale side-constraint is fulfilled.

We first check the whole interval whether its mean value satisfies the multiscale side-constraint. If , then . Otherwise, we randomly choose and from

 ⎧⎨⎩(i,j):t[0,1)([in,jn]):=∣∣∑jl=i(Yl−¯Y)∣∣σ√j−i+1−pen(j−i+1n)−qα(n)>0⎫⎬⎭, (26)

according to any distribution which is independent of the values ’s. Then we check intervals , and individually, and split them further in the same manner if necessary. This procedure is repeated until on each resulting interval its mean value satisfies the multiscale side-constraint, i.e. . Finally, .

Let denote the number of change-points (discoveries) and the number of segments introduced in the -th step. We make the convention that if the procedure stops before the -th step. It follows from , cf. (11), (recall here) that

 E[D1]≤2α and E[S1]≤3α.

Now we consider the three intervals and and bound the probability of further splitting them into smaller intervals. It will be shown that

 P{TIk(Y,¯YIk)>qα(#Ik)∣∣T[0,1)(Y,¯Y)>qα(n)}≤α for k=1,2,3.

Given , the random variable depends only on , , which is independent of and . It follows from (26) that depends only on and . Thus is independent of conditioned on .