Heterogeneous Change Point Inference
Abstract.
We propose (heterogeneous simultaneous multiscale changepoint estimator) for the detection of multiple changepoints of the signal in a heterogeneous gaussian regression model. A piecewise constant function is estimated by minimizing the number of changepoints over the acceptance region of a multiscale test which locally adapts to changes in the variance. The multiscale test is a combination of local likelihood ratio tests which are properly calibrated by scale dependent critical values in order to keep a global nominal level , even for finite samples.
We show that controls the error of over and underestimation of the number of changepoints. To this end, new deviation bounds for Ftype statistics are derived. Moreover, we obtain confidence sets for the whole signal. All results are nonasymptotic and uniform over a large class of heterogeneous changepoint models. is fast to compute, achieves the optimal detection rate and estimates the number of changepoints at almost optimal accuracy for vanishing signals, while still being robust.
We compare with several state of the art methods in simulations and analyse current recordings of a transmembrane protein in the bacterial outer membrane with pronounced heterogeneity for its states. An Rpackage is available online.
Key words and phrases:
changepoint regression, deviation bounds, dynamic programming, heterogeneous noise, honest confidence sets, ion channel recordings, multiscale methods, robustness, scale dependent critical values2010 Mathematics Subject Classification:
62G08,62G15,90C391. Introduction
1.1. Changepoint regression
Multiple changepoint detection is a long standing task in statistical research and related areas. One of the most fundamental models in this context is (homogeneous) gaussian changepoint regression
(1.1) 
Here, denotes the observations, is an unknown piecewise constant mean function, a constant (homogeneous) variance and are independent standard gaussian distributed errors. For simplicity, we restrict ourself in this paper to an equidistant sampling scheme , but extensions to other designs are straightforward. Methods for estimating the changepoints in (1.1) and in related models are vast, see for instance (Yao, 1988; Donoho and Johnstone, 1994; Csörgo and Horváth, 1997; Bai and Perron, 1998; Braun et al., 2000; Birgé and Massart, 2001; Kolaczyk and Nowak, 2005; Boysen et al., 2009; Harchaoui and LévyLeduc, 2010; Jeng et al., 2010; Killick et al., 2012; Rigollet and Tsybakov, 2012; Zhang and Siegmund, 2012; Fryzlewicz, 2014) and the references in these papers.
A crucial condition in most of the aforementioned papers is the assumption of homogeneous noise, i.e. a constant variance in (1.1). In many applications, however, this assumption is violated and the variance varies over time, , say. This problem arises for instance in the analysis of array CGH data, see (Muggeo and Adelfio, 2011; Arlot and Celisse, 2011). Further examples include economic applications, e.g. the real interest rate is modelled by Bai and Perron (2003) as piecewise linear regression with covariates and heterogeneous noise. In this paper we will discuss an example from membrane biophysics, the recordings of ion channels, see Section 5. It is well known that the noise of the open state can be much larger than the background noise, see (Sakmann and Neher, 1995, Section 3.4.4) and the references therein, rendering the different states as a potential source for variance heterogeneity.
To illustrate the effects of missing heterogeneity we show in Figure 1 a reconstruction by ^{1}^{1}1http://cran.rproject.org/web/packages/stepR, v. 1.03, 20150618 (Frick et al., 2014), a method that has been designed for homogeneous noise.
The constant variance assumption of leads to an overestimation of the standard deviation (which is preestimated by a global type estimator) in the first half and an underestimation in the second half. Therefore, in Figure 1 misses the first changepoint and includes artificial changepoints in the second half to compensate for the too small variance it is forced to use, see also (Zhou, 2014). Note, that this flaw is not a particular feature of , it will occur for any sensible segmentation method which relies on a constant variance assumption. Hence, from Figure 1 the fundamental difficulty of the heterogeneous (multiscale) changepoint regression problem becomes apparent: How to decide whether a change of fluctuations of the data result from high frequent changes in the mean or merely from an increase of the noise level? Apparently, if changes can occur on any scale (i.e. the length of an interval of neighbouring observations) this is a notoriously difficult issue and proper separation of signal and noise cannot be performed without extra information.
Indeed, the basis of the presented theory is that often a reasonable assumption is to exclude changes of the variance in constant segments of (see Section 1.2). Under this relatively weak assumption, we show in this paper that estimation of for heterogeneous data in a multiscale fashion becomes indeed feasible. In addition, we also aim for a method which is robust when changes in the variance occur at locations where the signal is constant, as we believe that this cannot be excluded in many practical cases. To this end, we introduce a new estimator (heterogeneous simultaneous multiscale changepoint estimator) which recovers the signal under heterogeneous noise over a broad range of scales, controls the familywise error rate to overestimate the number of changepoints, allows for confidence statements for all unknown quantities, obeys certain statistical optimality properties, and can be efficiently computed. At the same hand it is robust against heterogeneous noise on constant signal segments, which as a byproduct reveals it also as robust against more heavily tailed errors.
1.2. Heterogeneous changepoint model
To be more specific, from now on we consider the heterogeneous gaussian changepoint model
(1.2) 
where now the variance is also given by an unknown piecewise constant function. For the following theoretical results we assume that it only can have possible changepoints at the same locations as the mean function . In other words, is a pair of unknown piecewise constant functions in
(1.3) 
with unknown changepoint locations for some unknown number of changepoints and also unknown function values and of and . By technical reasons, we define and by continuous extension of and , respectively. For identifiability of we assume and exclude isolated changes in the signal by assuming that is a right continuous function. It is important to stress that in (1.3) we allow the variance to potentially have changes at the locations of the changes of the signal, but the variance need not necessarily change when changes, as we do not assume . In particular, homogeneous observations are still part of the model. The other way around, we assume that within a constant segment of it may not happen that the variance changes, i.e. the local signal to noise ratio is assumed to be constant on for all . We argue that this is a reasonable assumption in many applications (recall the examples given above and see our data example in Section 5), since a changepoint represents typically a change of the condition of the underlying state. Moreover, for example, in many engineering applications locally a constant signal to noise ratio is assumed (Guillaume et al., 1990), which motivates our modelling as well. However, we stress that the restriction to model (1.3) is only required for our theory. For the practical application we will show in simulations in Section 4.2 that is in addition robust against a violation of this assumption (i.e. when a variance change may occur without a signal change) and hence works still well in the general heterogeneous changepoint model (1.2)Å¿ with arbitrary variance changes.
1.3. Heterogeneous changepoint regression
Up to our best knowledge there are only few methods which explicitly take into account the heterogeneity of the noise in changepoint regression, either in the model considered here or in related models. Thereby, we have to distinguish two settings.
First, that also changes in the variance are considered as relevant structural changes of the underlying data (even when the mean does not change) and to seek for changes in the mean and in the variance, respectively. In this spirit are local search methods, such as binary segmentation () (Scott and Knott, 1974; Vostrikova, 1981) (if the corresponding single changepoint detection method takes the heterogeneous variance into account), but also global methods can achieve this goal, e.g. (Killick et al., 2012). For a Bayesian approach in this context see (Du et al., 2015) and the references therein. In addition, methods which search for more general structural changes in the distribution potentially apply to this setup as well, see e.g. (Csörgo and Horváth, 1997; Arlot et al., 2012; Matteson and James, 2014).
This is in contrast to the setting we address in this paper: The variance is considered as a nuisance parameter and we primarily seek for changes in the signal . Hence, we aim for statistically efficient estimation of the mean function, but still being robust against heterogeneous noise. Obviously, this cannot be achieved by methods addressing the first setting. Although of great practical relevance, this situation has only rarely been considered and in particular no theory exists, to our knowledge. The crossvalidation method (Arlot and Celisse, 2011) and (Muggeo and Adelfio, 2011) have been designed specifically to be robust against heterogeneous noise. Moreover, also circular binary segmentation (), see (Venkatraman and Olshen, 2007), applies to this.
For a better understanding of the problem considered here it is illustrative to distinguish our setting further, namely from the case when it is known before hand that changes in the variance will necessarily occur with changes in the signal. This will potentially increase the detection power as under this assumption variance changes can be used for finding signal changes, as well. The information gain due to the variance changes for this case has been recently quantified by Enikeeva et al. (2015) in terms of the minimax detection boundary for single vanishing signal bumps of size . More precisely, if the base line variance is and the variance at the bump is then the constant in the minimax detection boundary is for , see (Enikeeva et al., 2015, Theorems 3.13.3). For the particular case of homogeneous variance, i.e. , we obtain and the factor becomes maximal, see also (Dümbgen and Walther, 2008; Frick et al., 2014). This reflects that no additional information on the location of a change can be gained from the variance in the homogeneous case. Comparing this to the inhomogeneous case we see that when the variance change is known to be large enough, i.e. , additional information for the signal change can be gained from the variance change, as then , provided it is known that signal and variance change simultaneously.
In contrast, in the present setting the variance need not necessarily change when the signal changes, hence the ”worst case” of no variance change from above is contained in our model, which lower bounds the detection boundary. The situation is further complicated due to the fact that missing knowledge of a variance change can potentially even have an adverse effect because in model (1.3) detection power will be potentially decreased further as the nuisance parameter hinders estimation of changepoints of . For this situation the optimal minimax constants are unknown to us, but from the fact that the model with a constant variance is a submodel of our model (1.3) it immediately follows that the minimax constant for a single bump has to be at least . This will allow us to show that attains the same optimal minimax detection rate as for the homogeneous case and instead of as the constant appearing in the minimax detection boundary. Remarkably, the only extra assumption we have to suppose is that signal and variance have to be constant on segments at least of order , see Theorem 3.10. This reflects the additional difficulty to separate ”locally” signal and noise levels in a multiscale fashion. In other words, when we assume that the number of i.i.d. neighbouring observations (no change in signal and variance) in each segment is at least of order , separation of signal and noise will be done by in an optimal way (possibly up to a constant).
1.4. Heterogeneous changepoint inference
We define as the multiscale constrained maximum likelihood estimator restricted to all solutions of the following optimisation problem
(1.4) 
see also (Boysen et al., 2009; Davies et al., 2012; Frick et al., 2014) for related approaches. Here, (as a subset of ) is the set of all piecewise constant mean functions, the cardinality of the set of changepoints of and the right hand side of (1.4) a multiscale constraint to be explained now. Given a candidate function this tests simultaneously over the system of all intervals on which is constant, whether its function value is the mean value of the observations on the respective interval . In order to perform each test, i.e. to decide whether the observations have constant mean , the local loglikelihoodratio statistic
(1.5) 
with and local variance estimate
, is compared with a local threshold in a multiscale fashion, to be discussed now.
In what follows, we restrict the multiscale test to intervals in the dyadic partition
(1.6) 
where is the number of different scales and
(1.7) 
the set of intervals from the dyadic partition with length . This allows fast computation and simplifies the asymptotic analysis. Nevertheless, our methodology can be adapted to other intervals systems, see Remark 2.2.
It remains to determine thresholds for in (1.6) that combine the local tests appropriately. To this end, note that logarithmic (or related) scale penalisation as in the homogeneous case (Dümbgen and Spokoiny, 2001; Dümbgen and Walther, 2008; Frick et al., 2014) does not balance scales anymore appropriately in the heterogeneous case. In particular, this will give a multiscale statistic which diverges, since due to the local variance estimation the test statistic fails to have subgaussian (but still has subexponential) tails. To overcome this burden we introduce in Section 2 scale dependent critical values such that the multiscale test has global significance level , see (2.3). To this end, the different scales are balanced appropriately by weights , with , see (2.4) and (2.5). More precisely, these weights determine the ratios between the rejection probabilities of the multiscale test on a corresponding scale. Existence and uniqueness of the so defined scale dependent critical values is shown in Lemma 2.1 and explicit bounds are given in Lemma 3.1. The weights also allow to incorporate prior scale information, see Section 3.4.
Using the so obtained thresholds allows to obtain several confidence statements which are a main feature of . First of all, we show in Section 3 that the probability to overestimate the number of changepoints is bounded by the significance level uniformly over in (1.3), , see Theorem 3.3. More specifically, we show the overestimation bound
(1.8) 
see Theorem 3.4. In Theorem 3.5 we provide an exponential bound for the underestimation of the number of changepoints by , . To this end, we show new exponential deviation bounds for statistics (Section C.3), which might be of interest by its own. Combining the over and the underestimation bound provides upper bounds for the errors and . For a fixed signal both bounds vanish super polynomially in if when the weights are chosen appropriately, see Remark 3.6. Consequently, the estimated number of changepoints converges almost surely to the true number, see Theorem 3.7. Further, these exponential bounds enable us to obtain a confidence band for the signal as well as confidence intervals for the locations of the changepoints, for an illustration see Figures 1 and 2.
We show that the diameters of these confidence intervals decrease asymptotically as fast as the (optimal) sampling rate up to a log factor. All confidence statements hold uniformly over , all functions with minimal signal to noise ratio and minimal scale , with and arbitrarily, but fixed, see Theorems 3.8 and 3.9.
1.5. HSMUCE in action
Figure 2 illustrates the performance of in an example with observations and changepoints. We found that misses for one changepoint (as the choice tunes to provide the strong guarantee not to overestimate the number of changepoints with probability , see (1.8)), whereas for between and (only displayed for and ) the correct number of changepoints is detected always (while providing a weaker guarantee for not overestimating ). In addition, for between and each true changepoint is covered by the associated confidence interval at level . This illustrates the influence of the significance level . Notably, we find that the reconstructions are remarkably stable in . In fact, combining Lemma 3.1 and (A.1) shows that the width of the confidence band is proportional to which decreases only logarithmically for increasing .
We compare the performance of with (Venkatraman and Olshen, 2007), (Muggeo and Adelfio, 2011) and (Arlot and Celisse, 2011) in several simulation studies in Section 4 (see also Figure 9 in Supplement B for their performance on the data in Figure 2), where we also examine robustness issues, see Section 4.2. In all of these simulations and in the subsequent application performs very robust and includes too many changepoints only rarely in accordance with (1.8).
In Section 5 we apply to current recordings of a transmembrane protein with pronounced heterogeneity for its states. In contrast to segmentation methods which rely on homogeneous noise, we found that provides a reasonable reconstruction, where all visible gating events are detected.
Finally, we stress that the confidence band and confidence intervals for the changepoint locations provided by can be used to accompany any segmentation method to assess significance of its estimated changepoints. This is illustrated in Section 5 as well.
Computation of the estimator by a pruned dynamic program and of the critical values based on MonteCarlo simulation is explained carefully in Supplement A. There we also study the theoretical and empirical computation time of . Due to the underlying dyadic partition the computation of is very fast, in some scenarios even linear in the number of observations. Additional simulations results are collected in Supplement B and all proofs are given together with some auxiliary statements in Supplement C. An Rpackage is available online^{2}^{2}2http://www.stochastik.math.unigoettingen.de/hsmuce.
2. Scale dependent critical values
For the definition of it remains to determine the local thresholds in (1.4). First of all, the multiscale test on the r.h.s. in (1.4) should be a level test, i.e.
(2.1) 
Here, we use the same threshold for all intervals of the same length as no apriori information on the changepoint locations is assumed. More precisely, as we have restricted the multiscale test to the dyadic partition in (1.6) we aim to find a vector of critical values , where now if and only if . To this end, w.l.o.g., we may consider standard gaussian observations instead of , since the supremum in (2.1) is attained at and , see the proof of Theorem 3.3. We then define the statistics with in (1.7) as
(2.2) 
Then, the critical values fulfil (2.1) if
(2.3) 
with the cumulative distribution function of .
As the critical values are not uniquely determined by (2.3) they can be chosen to render the multiscale test particularly powerful for certain scales. To this end, we introduce weights
(2.4) 
where means to omit the th scale, i.e. . Finally, we define implicitly through
(2.5) 
with the cumulative distribution function of . If this will not enter the systems of equations in (2.5). The weights determine the fractions between the probabilities that a test on a certain scale rejects, and hence regulate the allocation of the level among the single scales. In summary, the choice of the local thresholds boils down to choosing the significance level and the weights , we discuss these choices in Section 3.4 more carefully. If no prior information on scales is available a default option is always to set all weights equal, i.e. .
The next result shows that the vector of critical values satisfying (2.3)(2.5) is always welldefined.
Lemma 2.1 (Existence and uniqueness).
An explicit computation of the vector q (or ) appears to be very hard, since the statistics are dependent, although the dependence structure is explicitly known. Alternatively, it would be helpful to have an approximation for the distribution (and hence its quantiles) of the maximum in (2.1), which, however, appears to be rather difficult, as well. For the case when (which does not apply to ), see Davies (1987, 2002). Therefore, we determine in Section A.2 the vector q by MonteCarlo simulations. Note that the distribution does not depend on the specific element and hence the critical values can be computed in a universal manner. We stress that the determination of the scale dependent critical values is not restricted to our setting and can also be applied to multiscale testing in other contexts. Different to scale penalisation and like the block criterion in (Rufibach and Walther, 2010) no model dependent derivations are required and the critical values are adapted to the exact finite sample distribution of the local test statistics. However, our approach allows additionally a flexible scale calibration by the choice of the weights (see Section 3.4) and arbitrary interval sets can be used as the following remark points out.
Remark 2.2 (Other interval sets).
can be easily adjusted to other interval sets as follows. Let be an arbitrary set of intervals. Then, we replace in the definition of in (3.3) and (3.7) the set by the set and the vector by the vector (empty scales should be omitted) in Section 2, with
(2.6) 
Again it remains to choose the significance level and the weights to determine the critical values required for . Note, however, that the critical values and its bounds in Lemma 3.1 and therefore the results in Section 3 (besides of Theorems 3.3 and 3.4) will depend on the specific system and have to be computed for each separately.
Employing a larger interval set than may lead to a better detection power, but at the price of a larger computation time. Hence, in practice, a tradeoff between computational and statistical efficiency may guide this choice as well. Our Rpackage includes beside of the dyadic partition also the system of all intervals (of order , statistically most efficient, but computationally expensive) and the system of all intervals of dyadic length (, intermediate efficiency and computational time). Interesting choices might be also approximating sets like introduced in (Walther, 2010; Rivera and Walther, 2013) which are larger than the dyadic partition, but achieve the minimax boundary in the context of density estimation.
3. Theory
In this section we collect our theoretical results. We start with finite bounds for the critical values. These will allow to bound . With these bounds we obtain confidence statements for the signal and its main characteristics. Finally, we investigate asymptotic detection rates of for vanishing signals.
3.1. Finite bounds for over and underestimation
In the following we require upper bounds for the critical values, since the definition of the critical values by the equations (2.3)(2.5) is implicit.
Lemma 3.1 (Bound on critical values).
Remark 3.2.
The following theorem shows that the significance level controls the probability to overestimate the number of changepoints.
Theorem 3.3 (Overestimation control I).
Assume the heterogeneous gaussian changepoint model (1.2). Let be the number of changepoints of a signal . Let further be the estimated number of changepoints by , i.e.
(3.3) 
Then, for any vector of critical values q with significance level and weights in (2.3)(2.5), uniformly over in (1.3) it holds
The theorem gives us a direct interpretation of the parameter as the probability to overestimate the number of changepoints. This even holds locally, i.e. on every union of adjoining segments of the estimator with probability there are at least as many changepoints as detected. Moreover, we strengthen the result by showing that the probability to estimate additional changes decays exponentially fast and hence the expected overestimation is small.
Theorem 3.4 (Overestimation control II).
To control the probability we need additionally an upper bound for the probability to underestimate . Unlike to the overestimation bounds in the Theorems 3.3 and 3.4 the probability to underestimate cannot be bounded uniformly over , since size and scale of changes could be arbitrarily small. This is made more precise in Theorem 3.10 which gives the detection boundary in terms of the smallest (standardized) jump size and the smallest scale . The next theorem provides an exponential bound uniformly over the subset
(3.4) 
with arbitrary, but fixed.
Theorem 3.5 (Underestimation control).
Roughly speaking, detects any changepoint of the signal under assumptions of Theorem 3.5 at least with probability . A sharper version with different probabilities is given in Theorem C.5 in the supplement. Such a result clarifies the dependence on the different weights, but is technically way more difficult. Combining Theorems 3.3, 3.4 and 3.5 gives upper bounds for the probability and the expectation that missspecifies the number of changepoints.
Remark 3.6 (Vanishing errors).
For a fixed signal (fixed and are sufficient) both errors vanish asymptotically if is chosen such that , with triangular scheme for the weights in (2.5). We can achieve a rate arbitrary close to the exponential rate by the choice , with arbitrarily slow. The condition on the sequence allows a variety of possible choices of the weights, too. For instance, the choice , which weights all scales equally, fulfils this condition.
A direct consequence is the strong model consistency of .
Theorem 3.7 (Strong model consistency).
Assume the setting of Theorem 3.3 and let be the sequence of estimated numbers of changepoints by , where is as with significance level and corresponding weights . Moreover, let be as in (3.4) with arbitrary, but fixed, and . Let be arbitrary, but fixed. If
(3.6) 
holds, then , almost surely and uniformly in .
3.2. Confidence sets
In this section we obtain confidence sets for the signal and for the locations of the changepoints. First, we show that the set of all solutions of (1.4)
(3.7) 
is a confidence set for the unknown signal .
Theorem 3.8 (Confidence set).
This shows that the asymptotic coverage of is at least . Lemma C.6 gives an exponential inequality similar to (3.5) which shows that is also a nonasymptotic confidence set. We further derive from this set confidence intervals for the changepoint locations.
Theorem 3.9 (Changepoint locations).
Assume the setting of Theorem 3.8, where is replaced by a sequence . Let and s.t.
(3.9) 
Then,
(3.10) 
Here, the rate is equal to the sampling rate up to the (logarithmic) rate depending on the tuning parameters . For example, if , is sufficient to satisfy (3.9). A nonasymptotic statement is given in Lemma C.7 in the supplement. For visualization of the confidence statements it is useful to further derive a confidence band for the signal as in (Frick et al., 2014, Corollary 3 and the explanation around). It can be shown that also the collection , with confidence intervals for the changepoint locations according to Theorem 3.9, satisfies (3.8). Recall Figures 1 and 2 for an illustration. It is also possible to strengthen the statements of this section to sequences of vanishing signals with and slow enough, but we omit such results.
3.3. Asymptotic detection rates for vanishing signals
For the detection of a single vanishing bump against a noisy background see Theorem C.8 in the supplement. The following theorem deals with the detection of a signal with several vanishing changepoints.
Theorem 3.10 (Multiple vanishing changepoints).
Assume the heterogeneous gaussian changepoint model (1.2). Let be the sequence of true number of changepoints. Let further be the sequence of the estimated numbers of changepoints by (3.3), with significance levels and weights . Let be a sequence of submodels as in (3.4) and . We further assume
(3.11) 
as well as

for large scales, i.e. , the limit ,

for small scales, i.e. , the inequality
(3.12) with possibly , but such that and
with for bounded and for unbounded.
Then,
Theorems C.8 and 3.10 state conditions on the tuning parameters and as well as on the length of the minimal scale (to simplify notations we only write in the following) and the standardized jump size to detect the vanishing signals uniformly over . If, in addition, holds, then we control also the probability to overestimate the number of changepoints and therefore the estimation of the number of changepoints is still consistent in the case of a vanishing signal. The main condition in both theorems is that has to be at least of order , see (C.9) and (3.12). This is optimal in the sense that no signal with a smaller rate can be detected asymptotically with probability one, see (Dümbgen and Spokoiny, 2001; Chan and Walther, 2013; Frick et al., 2014) for the case of homogeneous observations, and note that this is a submodel of our model. But different to the homogeneous case we need, in addition, that is at least of order , see (C.10) and (3.11). Such a restriction appears reasonable, since for the additional variance estimation only the number of observation on the segment is relevant and not the size of the change. Finally, we observe that the constants encountered in the lower detection bound for in (C.9) and (3.12) increase with the difficulty of the estimation problem, where the difficulty is represented by the number of vanishing segments. All of these constants are a little bit larger as the analogue constants for in (Frick et al., 2014, Theorem 5 and 6) reflecting the additional difficulty encountered by the heterogeneous noise. More precisely, we have instead of the optimal for one vanishing segment, instead of for a bounded number of vanishing segments and instead of for an unbounded number of vanishing segments. Note again, that the optimal constants for the heterogeneous case are unknown to us.
3.4. Choice of the tuning parameters
In this section we discuss the choice of the tuning parameters and .
Choice of
As illustrated in Figure 2 the choice depends on the application. If a strict overestimation control of the number of changepoints is desirable should be chosen small, e.g. or , recall Theorems 3.3 and 3.4. This might come at the expense of missing changepoints but with large probability not detecting too many (recall Figure 2 and see also the simulations in Section 4). If changepoint screening is the primarily goal, i.e. we aim to avoid missing of changepoints, should be increased, e.g. or even higher, since Theorem 3.5 shows that the error probability to underestimate the number of changepoints decreases with increasing . If model selection, i.e. , is the major aim, an intermediate level that balances the over and underestimation error should be chosen, e.g. between and . Both errors vanish super polynomially for the asymptotic choice , see Remark 3.6. A finite sample approach is to weight these error probabilities , with , and to choose such that its upper bound
is minimized. This also allows to incorporate prior information on . Alternatively, the bound on the expectation by combining Theorems 3.4 and 3.5 can be minimized to take the size of the missestimation into account. Despite of all possibilities to choose the ’best’ for a given application, comparing estimates at different can be helpful to trace the ”stability of evidence” of the estimated changepoints at different significance levels. Of course, the interpretation of such a ”significance screening” does not allow for a frequentist interpretation of a significance level anymore as has to be fixed in advance, see e.g. (Schervish, 1996). Nevertheless, it might give for instance some indication whether to perform further experiments. Despite of this, for a fixed the confidence statements of can also be used to support findings by other estimators. This is illustrated in Section 5 for the ion channel application.
Choice of
As a default choice we recommend equal weights . This choice fulfils (together with many other choices) the conditions of the Theorems 3.7 and 3.8. Unlike as for the significance level only the bound for the underestimation of the number of changepoints depends on these weights. Note, that this gives the user the possibility to incorporate prior information on the scales without violating the overestimation control in Theorems 3.3 and 3.4: If for instance changes are expected to occur only on small segments then the detection power on these scales can be increased if the first weights are chosen large and the other ones small (or even zero). In contrast, if the general signal to noise ratio is expected to be very small then it is nearly impossible to detect changes on small scales and larger scales should be weighted more to detect at least the changes on these scales. A quantitative influence of the weights on the detection power can be seen in the underestimation bound in Theorem C.5 in Supplement C which is a refinement of Theorem 3.5. We also investigate such choices quantitatively in simulations in Section 4.
4. Simulations
In this section we compare ^{3}^{3}3http://www.stochastik.math.unigoettingen.de/hsmuce, v. 0.0.0.9000, 20150415 in simulations with (Venkatraman and Olshen, 2007), (Muggeo and Adelfio, 2011) and (Arlot and Celisse, 2011) as they are also designed to be robust against heterogeneous noise. Moreover, we include (Frick et al., 2014) in simulations with a constant variance as a benchmark to examine how much the detection power of decreases in this case, which may be regarded as the price for adaptation to heterogeneous noise. We fix the weights and vary the significance level . A simulation with tuned weights can be found in Section B.2 in the Supplement. For circular binary segmentation () we call the function segmentByCBS^{4}^{4}4http://cran.rproject.org/web/packages/PSCBS/, v. 0.40.4, 20140204 with the standard parameters. For the crossvalidation method we use the Matlab function proc_LOOVF^{5}^{5}5http://www.di.ens.fr/~arlot/code/CHPTCV.htm, v. 1.0, 20101027 with the parameter choice of the demo file. For we call the method jumpoints^{6}^{6}6http://cran.rproject.org/web/packages/cumSeg/, v. 1.1, 20111014 with the parameter large enough such that the estimation is not influenced by this choice. For we call the function smuceR^{7}^{7}7http://cran.rproject.org/web/packages/stepR, v. 1.03, 20150618 with the standard parameters, in particular the interval set of all intervals is used if .
To avoid specific interactions between the signal and the dyadic partition we generate in each repetition a random pair (all random variables are independent from each other).

We fix the number of observations , the number of changepoints , a constant and a minimum value for the smallest scale .

We draw the locations of the changepoints uniformly distributed with the restriction that .

We choose the function values of the standard deviation function by , where are uniform distributed on .

We determine the function values of the signal such that
(4.1) Thereby, we start with and choose randomly with probability whether the expectation increases or decreases.
By (4.1) we provide a situation where all changepoints are similarly hard to find, recall the minimax detection boundary from Section 3.3. An example has been displayed in Figure 2 in the introduction, where misses at one changepoint and detects for between and (only displayed for and ) the correct number of changepoints. In Figure 9 (Supplement B) we see that (Venkatraman and Olshen, 2007) finds also all changepoints, but detects further changes. Less good is the performance of (Muggeo and Adelfio, 2011) and (Arlot and Celisse, 2011) which both miss several changes and adds also a false positive. We examine these methods now more extensively. All simulations are repeated times.
In the following we report the difference between the estimated and the true number of changepoints as well as the mean of the absolute value of this difference. Additionally, we use the false positive sensitive location error
with such that , i.e. the left and right neighbouring changepoints to the middle point of , and the false negative sensitive location error
with such that , see (Futschik et al., 2014, Section 3.1), to rate the estimation of the locations of the changepoints. We also show the mean integrated squared (absolute) error () for all methods.
4.1. Simulation results
In this section we discuss the results of the simulations for model (1.2) and (1.3). We start in Table 1 (Supplement B) with the simple setting of a single change at the midpoint, where we vary the variances on the adjoining segments. In Table 2 (Supplement B) we display results for a constant variance and in Table 3 (Supplement B) for heterogeneous errors. We excluded from simulations for larger due to its large computation time, confer the run time simulations in Section A.3 in the supplement.
All simulations confirm the overestimation control for from Theorem 3.3 and the exponential decay of the overestimation in Theorem 3.4. The simulations with a single changepoint confirm that the size of the variance change has no influence, rather the size of the variances matters. We found that performs well compared to all other methods. A small avoids overestimation, but risks to miss changes that are harder to detect. Thus, the comparison of the estimates of for different shows in accordance with our theory that it is reasonable to relax if changes are expected to be harder to detect (recall the discussion in Section 3.4). From the other methods performs best in the easier and in the difficult scenarios, whereby and in particular shows a tendency to overestimate the number of changepoints.
For a constant signal (corresponding to in Table 2) overestimates the number of changepoints even slightly less than , whereas and overestimate hardly ever. In the case of a constant variance we found that the detection power of is only slightly worse than for , although used instead of the dyadic partition the system of all intervals. The difference is larger for and in this case also and performs better than , since the detection power of depends strongly on the lengths of the constant segments. Moreover, plays a similar role as the number of changepoints , since the average constant segments length decreases if decreases or increases. Worse results for smaller lengths are due to the familywise error control of as it guarantees a strict control of overestimating the number of changepoints.
Similar results can be observed for with heterogeneous errors. performs better than and , and in particular better than in the single changepoint setting. outperforms for , although has a much smaller tendency to overestimate the number of changepoints, whereas in particular and tend to overestimation. This can also be seen for the and as these measures are much more affected by underestimation than by overestimation. These findings are also supported by the and the , the is heavily affected by overestimation, whereas the is larger in case of underestimation.
In all simulations with heterogeneous errors and observations outperforms the other methods, for observations this becomes even more pronounced. In comparison to the simulation with observations the tendency of to overestimate the number of changepoints becomes then also more prominent. Finally, in further simulations (not displayed) we found that the detection power of all methods decreases for smaller in (d), but all results remain qualitatively the same. All in all, we found that performs well as sample size becomes larger, in particular if the constant segments are not too short as indicated by assumption (3.11) in Theorem 3.10.
A comparison of Table 3 and 4 (Supplement B) shows that tuned weights increase the detection power of for all significance levels, so we encourage the user to adapt the weights if prior information on the scales where changes occur is available. Details how the weights are chosen can be found in Section B.2 in the supplement.
4.2. Robustness against model violations
We begin by investigating how robust the methods are against a violation of the assumption that the standard deviation changes only at the same locations as the mean changes. We consider continuous changes as well as abrupt changes. The exact functions for the standard deviation can be seen in Figure 10 (Supplement B). In Table 5 (Supplement B) we see that and perform very robust against heterogeneous noise on the constant segments, whereas, remarkably, the detection power of is even improved. Moreover, in additional simulations (not displayed) with less observations we found that is very robust, too.
Moreover, we examine robustness against small periodic trends in the mean in simulations similar to those in (Venkatraman et al., 2004), also adapted to the inhomogeneous variance. The exact simulation setting can be found in Section B.3 (Supplement B). We obtain from Table 6 that shows similar results for small trends compared to the simulation without trend for small trends, but is affected by larger trends, in particular if these are not scaled by the standard deviation. overestimates heavily in all cases, whereas (although not affected by the trend) shows over and underestimation.
Furthermore, we investigate robustness against heavy tails of the error distribution. In Table 7 (Supplement B) we consider distributed errors which are scaled such that the expectation and the standard deviation are the same as in Section 4.1. As expected is not robust against heavy tails (as it misinterprets extreme values as a change in the signal, whereas provides reasonable results. In comparison to gaussian errors is not influenced for , underestimation is more distinct in the constant variance scenario and detection power is even increased in the scenario with heterogeneous errors. In comparison, is not influenced for , too, underestimates and overestimates in the constant variance scenario and is slightly worse with a tendency to underestimation in the scenario with heterogeneous errors, whereas overestimates rarely, but heavily for , underestimates and overestimates in the constant variance scenario and is robust in the last scenario.
In summary, seems to be robust against a wide range of variance changes on constant segments and seems to be only slightly affected by larger tails than gaussian, in particular no tendency to overestimation was visible in our simulations. This may be explained by the fact that the local likelihood tests of are quite robust against heterogeneous noise, see for instance (Bakirov and Szekely, 2006; Ibragimov and Müller, 2010), and against nonnormal errors, see (Lehmann and Romano, 2005) and the references therein. Unlike the number of changepoints, the locations are sometimes missestimated, since the restricted maximum likelihood estimator is influenced by changes of the variance. Instead, more robust estimators, for instance local median and MAD estimators, could be used.
5. Application to ion channel recordings
In this section we apply to current recordings of a porin in planar lipid bilayers performed in the Steinem lab (Institute of Organic and Biomolecular Chemistry, University of Göttingen). Porins are barrel proteins present in the outer membrane of bacteria and in the outer mitochondrial membrane of eukaryotes (Benz, 1994; Schirmer, 1998). Due to their large pore diameter they enable passive diffusion of small solutes like ions or sugars. The partial blockade of the pore by an internal loop results in gating that can be detected using the voltage clamp technique (Sakmann and Neher, 1995). We aim to detect the gating automatically, since in many ion channel applications hundred or more datasets each with several hundredthousands data points have to be analysed. For noise reduction the data was automatically preprocessed in the amplifier with an analogue fourpole Bessel lowpass filter of 1 kHz. Hence, the noise is coloured, but the correlation is less than if the trace is subsampled by eleven or more observations, see (Hotz et al., 2013, (6)), which has been done in the following. Finally, we apply to subsampled observations , , and .
In Figure 3 we see that the signal fluctuates around two or more levels, the so called open (higher conductivity, larger current measurements) and closed (lower conductivity, smaller current measurements) states. Moreover, the variance in the open states is larger than in the closed state, a well known phenomenon denoted as open channel noise (Sakmann and Neher (1995, Section 3.4.4) and the references therein) which arises for larger ion channels such as porins from conformational fluctuations in the channel protein (Sigworth, 1985). Due to the pronounced heterogeneity in the variance, methods which assume a constant variance fail to reconstruct the gating, see Figure 1 in the introduction for an illustration. In contrast, at provides a reasonable fit that covers the main features of the data. Additional smaller changes are found by , and , see Figure 3. These changes might be explained by some uncontrollable base line fluctuations caused for instance by small holes in the membrane due to movements of the lipids. On the other hand, we found in the simulations, see Table 3 and the example in Figure 9 (both Supplement B), that and tend to include small artificial changes, whereas we saw in Table 6 (Supplement B) that is quite robust against small periodic trends in the signal. For illustrative purposes, in order to examine these changes further we increase in Figure 3 the significance level and detect for instance changes around 33.0s and 33.2s, too. Taking also the confidence regions of