A linear time method for the detection of point and collective anomalies

A linear time method for the detection of point and collective anomalies

Alexander Fisch, Idris Eckley, and Paul Fearnhead
Lancaster University, United Kingdom

The challenge of efficiently identifying anomalies in data sequences is an important statistical problem that now arises in many applications. Whilst there has been substantial work aimed at making statistical analyses robust to outliers, or point anomalies, there has been much less work on detecting anomalous segments, or collective anomalies. By bringing together ideas from changepoint detection and robust statistics, we introduce Collective And Point Anomalies (CAPA), a computationally efficient approach that is suitable when collective anomalies are characterised by either a change in mean, variance, or both, and distinguishes them from point anomalies. Theoretical results establish the consistency of CAPA at detecting collective anomalies and empirical results show that CAPA has close to linear computational cost as well as being more accurate at detecting and locating collective anomalies than other approaches. We demonstrate the utility of CAPA through its ability to detect exoplanets from light curve data from the Kepler telescope.

1 Introduction

Anomaly detection is an area of considerable importance for many time series applications, such as fault detection or fraud prevention, and has been subject to increasing attention in recent years. See Chandola et al. (2009) and Pimentel et al. (2014) for comprehensive reviews of the area. As Chandola et al. (2009) highlight, anomalies can fall into one of three categories: global anomalies, contextual anomalies, or collective anomalies. Global anomalies and contextual anomalies are defined as single observations which are outliers with regards to the complete dataset and their local context respectively. Conversely, collective anomalies are defined as sequences of observations which are not anomalous when considered individually, but together form an anomalous pattern.

A number of different approaches can be taken to detect point (i.e. contextual and/or global) anomalies. These are observations which do not conform with the pattern of the data. Hence, the problem of detecting point anomalies can be reformulated as inferring the general pattern of the data in a manner that is robust to anomalies. The field of robust statistics offers a wide range of methods aimed at this problem. For instance, Rousseeuw and Yohai (1984) proposed estimators to robustly estimate the mean and variance, which were extended to a multivariate setting by Rousseeuw (1985). A wide variety of robust time series models also exist. For example, Muler et al. (2009) proposed a robust ARMA model, Muler and Yohai (2002) a robust ARCH model, and Muler and Yohai (2008) a robust GARCH model. A robust non-parametric method, which decomposes time series into trend, seasonal component, and residual was proposed by Cleveland et al. (1990).

The machine learning community has also provided a rich corpus of work for the detection of point anomalies. Commonly used methods include nearest neighbour based approaches, such as the local outlier factor (Breunig et al. (2000)), and information theoretical methods such as the one introduced by Guha et al. (2016). It is beyond the scope of this paper to review them all. Instead we refer to excellent reviews that can be found in Chandola et al. (2009) and Pimentel et al. (2014).

One common drawback of several point anomaly approaches is their inability to detect anomalous segments, or collective anomalies. Such features are of significance in many applications. One example is the analysis of brain imaging data, where periods in which the brain activity deviates from the pattern of the rest state have been associated with sudden shocks (Aston and Kirch (2012)). Another example is in detecting regions of the genome with unusual copy number (Bardwell and Fearnhead (2017) Siegmund et al. (2011) Zhang et al. (2010)), with such copy number variation being associated with diseases such as cancer (Jeng et al. (2012)).

The main contribution of this paper is to use an epidemic changepoint model as a principled framework for both collective and point anomalies. An epidemic changepoint model assumes that the data follow a certain typical distribution at all time, except during some anomalous time windows. The behaviour changes away from the typical distribution at the beginning of these windows and returns to it at the end. These epidemic changes can naturally be interpreted as collective anomalies. For the case in which collective anomalies are characterised by epidemic changes in mean and/or variance, point anomalies can additionally be modelled as epidemic changes of length one in variance only. This framework thus allows for the joint modelling and detection of collective anomalies and point anomalies. We therefore call the algorithm Collective And Point Anomalies (CAPA).

As a motivation for our work, and to help make the ideas in this paper concrete, consider the problem of detecting exoplanets via the so called transit method first proposed by Struve (1952). The luminosity of a star is measured at regular intervals, with the aim of detecting periodically recurring segments of reduced luminosity. Such periods indicate the transit of a planet (Sartoretti and Schneider (1999)) and can naturally be interpreted as collective anomalies. The light curves are typically preprocessed (Mullally (2016)) and both the raw and whitened light curves can be accessed online. We have included the whitened light curve of the star Kepler 1132 in Figure 1 to illustrate the nature of this type of data. We note the presence of a global anomaly on day 1550 and the noisy nature of the data, making the detection of transits challenging given the weak signal induced by planetary transits. Indeed, even the transit of Jupiter past the sun reduces the latter’s luminosity by only 1% (Sartoretti and Schneider (1999)).

(a) Full data
(b) Subset (120 days)
Figure 1: Light curve of Kepler 1132, obtained at approximately 30 minute intervals. Missing values are due to periods in which the star was not observed. Note the presence of a point anomaly on day 1550 and the fact that no transit signature is apparent to the eye either in the full data or in the zoom, despite the known presence of Kepler 1132-b, an exoplanet orbiting this star.

Existing work on the detection of collective anomalies can be found in the statistics and machine learning literature. On the statistical side, hidden Markov models have been proposed, which assume that a hidden state obeying a Markov chain determines whether the data produced is anomalous or typical (Smyth (1994)). The underlying assumption that anomalous segments share one or multiple common behaviours is very attractive for the exoplanet detection application outlined above, but can be a constraint in others. Hidden Markov models also suffer from the fact that they are not robust to global anomalies which are present in the Kepler data, as can be seen in Figure 1. Moreover, they tend to be slow to fit, which is an important disadvantage in many modern, big-data applications. For example, there are currently 40 million light curves, similar to that shown in Figure 1, that have been gathered and need analysing. Conversely, machine learning methods include LinkedIn’s luminol (Maheshwari et al. (2014)) which uses a sign test to detect segments of anomalous mean. However, as we will see in the simulation section of this paper, this method’s performance can be poor.

Epidemic changepoints can be modelled as two classical changepoints, for the detection of which a variety of methods have been proposed (Fearnhead and Rigaill (2017), Fryzlewicz (2014), James et al. (2016), Killick et al. (2012), Ma and Yau (2016)). However, this approach does not exploit the fact that the behaviour of the segment before the start and after the end of an epidemic segment is the same, which reduces its statistical power. This is a disadvantage, especially when faced with a weak signal, like in the light curve data.

The epidemic changepoint problem as such was first considered by Levin and Kline (1985), who use a cumulative sum type statistic to detect them (Yao, 1993). The main corpus of work addressing the problem of their detection has since been driven by the analysis of neuroimaging and genome data. An early application of epidemic changepoints to neuroimaging data can be found in Robinson et al. (2010), who use a hidden Markov model to detect epidemic changes in mean. This was later extended by Aston and Kirch (2012). Both methods are vulnerable to point anomalies, a shortcoming in some applications like the one we consider in this paper. Another limitation is that both approaches assume the presence of at most one change. Conversely, motivated by challenges arising in Genomics, a range of methods, both univariate and multivariate, have been proposed to detect epidemic changes in mean, mainly by considering sum of squares type test statistics (see Jeng et al. (2012) Siegmund et al. (2011) Zhang et al. (2010)), sometimes in combination with hidden states. They are therefore vulnerable to global anomalies. A more general Bayesian hidden state method for the detection of anomalous segments was proposed by Bardwell and Fearnhead (2017).

The article is organised as follows: We begin by introducing a parametric model with epidemic changes in Section 2. This provides a general framework for collective anomalies, the location of which we infer by minimising a penalised cost. Motivated by our application, we will place a special emphasis on the detection of joint epidemic changes in mean and variance and show that epidemic changes of length one in variance only can be incorporated to model point anomalies.

In the classical changepoint setting, information is not typically shared between different segments of the data. However, in this epidemic setting, the typical parameter is shared, making it impossible to minimise the penalised cost via the dynamic programming approach of Jackson et al. (2005). We therefore provide an algorithm in Section 3 which minimises an approximation to the penalised cost based on a robust estimate of the parameter of the typical distribution. This approximation can be minimised by a dynamic program, which can be pruned in a fashion similar to Killick et al. (2012). As a result of this pruning, we find that the run time of CAPA is close to in some cases.

We present theoretical results regarding the consistency of CAPA at detecting collective anomalies in Section 4. Specifically, we introduce a proof of consistency for the detection of joint classical changes in mean and variance using a penalised cost approach, which is of independent interest. We then compare CAPA to other methods in a simulation study in Section 5 and show that it outperforms them, especially in the presence of point anomalies. We conclude the paper by demonstrating in Section 6 that CAPA can be used to detect Kepler 1132-b, an exoplanet which orbits Kepler 1132 (Morton et al. (2016)). The proofs of the theoretical results are all given in the appendix and the supplementary material can be found at the end of the paper. Code implementing CAPA can be accessed at https://github.com/Fisch-Alex/anomaly.

2 A Modelling Framework for Collective Anomalies

We assume that the data follow a parametric model and model collective anomalies as epidemic changes in the model parameters. Whilst, in practice, it is unlikely that the distribution of the data in an anomalous segment will belong to the same family of distributions as the distribution of the typical data, it can nevertheless be expected that a set of parameters different from the typical distribution’s will offer a better fit. We say that data follow a parametric epidemic changepoint model if

where is the usually unknown parameter of the typical distribution, from which the model deviates during the anomalous segments ,…,. We assume these windows do not overlap, i.e. . Note that fitting an epidemic changepoint requires only one new set of parameters for , since the typical parameter is shared across the non-anomalous segments. This compares favourably with the two additional sets of parameters for introduced when an epidemic changepoint is fitted using two classical changepoints. We therefore gain statistical power. This gain is particularly important when is high dimensional.

It is possible to infer the number and location of epidemic changes by choosing , ,…,, and , which minimise the penalised cost


subject to , where is the minimum segment length for an appropriate cost function and a suitable penalty . For example, could be defined as the negative log-likelihood of under the parametric model using parameter . The penalty could then be set to and this would be a BIC type penalty.

Using the formulation in (1), we can infer the location of joint epidemic changes in mean and variance by minimising the penalised cost related to the negative log-likelihood of Gaussian data. In this case contains both the mean and variance and we minimise


using a minimum segment length of 2 to account for the fact that is two dimensional.

It is well known that many changepoint detection methods struggle in the presence of point anomalies in the data and tend to fit two changepoints around each of them (Fearnhead and Rigaill (2017)). An approach based on minimising the above cost function is not intrinsically immune to it. However, we can modify the model by allowing epidemic changes, in variance only, of length one to address this issue. We therefore choose , ,…,, , , as well as the set of point anomalies , which minimise the modified penalised cost

where is a penalty smaller than . This modification ensures that it is now cheaper to fit an outlier as an epidemic changepoint in variance only than as a full epidemic change. Consequently, the method becomes robust against point anomalies, fitting epidemic changes only around true collective anomalies. This modification has the added benefit that it allows the algorithm to detect and distinguish between point and collective anomalies. This property is important for a range of applications in which collective and point anomalies have different interpretations (see Section 6 for an example).

3 Inference for Collective Anomalies

Input: A set of observations of the form, where .
Penalty constants and for the introduction of a collective and a point anomaly respectively
A minimum segment length
Initialise: Set , .
1: Obtain robust estimates of the mean and variance
3:for  do
4:       Centralise the data
5:end for
6:for  do
7:       Collective Anom.
9:       No Anomaly
10:       Point Anomaly
12:      switch  do Select type of anomaly giving the lowest cost
13:            case  :             
14:            case  :             
15:            case  :                   
16:end for

Output The points and segments recorded in

Algorithm 1 CAPA Algorithm (No Pruning)

We now turn to consider the problem of minimising the penalised cost we introduced in the previous section. Unlike in the classical changepoint problem considered by Jackson et al. (2005), the penalised cost given by equation (1) can not be minimised using a dynamic program, since the parameter is shared across multiple segments and typically unknown. We therefore use robust statistics to obtain an estimate for . Such robust estimates can be obtained for a variety of models (Hampel et al. (1986) Jurečková and Picek (2005)). For example, the median, -estimators, or the clipped mean can be used to robustly estimate the mean. The inter quantile range, the median absolute deviation, or the clipped standard deviation can be use to estimate the variance. Robust regression is available to estimate the parameters of AR models.

Having obtained , we then minimise

as an approximation to (1). Since it can be expected that most data belongs to the typical distribution, should be close to . One might therefore expect that using this estimate will have little impact on the performance of the method, which we also show theoretically for joint changes in mean and variance in Section 4.2.

The approximation to the penalised cost can be minimised exactly by solving the dynamic programme


where is the cost of the most efficient partition of the first observations and . For example, solving the dynamic programme

approximately minimises the penalised cost for joint epidemic changes in mean and variance defined in equation (2). Similarly, we can minimise its point anomaly robust analogue by solving the dynamic programme

where is a small constant ensuring that the argument of the logarithm will be larger than 0 (see Algorithm 1 for pseudocode). Adding the term is necessary when order statistics are used to obtain . Assuming that the observations are independent and Normal, all sums will be non-zero with probability 1, meaning that in theory such a correction is not necessary for the other logarithmic term. In practice, observations are of finite precision and adding to the argument of the other logarithmic term, with set to the level of rounding should be considered.

Solving the full dynamic program is at least . This lower bound can be achieved for the detection of joint changes in mean and variance. However, we can prune the solution space by borrowing ideas from Killick et al. (2012), provided the loss function is such that adding a free changepoint will not increase the cost – a property which holds for many commonly used cost functions such as the negative log-likelihood. Indeed, the following proposition holds:

Proposition 1

Let the cost function be such that

holds for all , , and such that . Then, if

holds for some , we can disregard for all future steps of the dynamic programme.

Proof: Please see the Appendix. Note that the time after which an option can be discarded also depends on the minimum segment length, something not considered by Killick et al. (2012).

This results enables us to reduce the computational cost. In practice, we found that it was close to for the detection of joint epidemic changes in mean and variance when the number of true epidemic changes increased linearly with the number of observations.

4 Theory for Joint Changes in Mean and Variance

We now introduce some theoretical results regarding the consistency of CAPA. All proofs for the results in this section can be found in the appendix. We establish that the consistency of CAPA can be viewed as a corollary of the consistency of a statistical procedure minimising a penalised cost function to detect classical changepoints. Consequently, we will begin by proving that method’s consistency for the detection of changes in mean and variance in Section 4.1. To the best of our knowledge, no such result exists in the literature, which makes this proof of independent interest. We then proceed to proving the consistency of CAPA in Section 4.2.

4.1 Consistency of Classical Changepoint Detection

Consider the sequence which is normally distributed with changepoints. The sequence therefore satisfies , where

Here denote the start of the series, the changepoints, and the end of the series. Changes in mean and variance can be of varying strength. To quantify this, we define the signal strength of the change in variance at the th changepoint to be

We note that is equal to 0 if, and only if, . We also define the signal strength of change in mean at the th changepoint to be

Note that these two quantities can be combined into a global measure of signal strength for the th change (see Lemma 7 in the Appendix for details).

We now define the penalised cost of data under partition to be

for . Here is a strengthened SIC-style penalty (Fryzlewicz (2014)) for introducing an additional changepoint. We estimate changepoints with a cost of segment

similar to the one we use to infer the location of epidemic changes in mean and variance. Since this leaves two parameters to fit, we impose a minimum segment length of two for all partitions.

Assume that there exists some such that for all , which ensures that the changepoints are sufficiently spaced apart to allow for their detection. Then, the following consistency result holds for the inferred number and location of changepoints and inferred by minimising :

Theorem 1

Let follow the distribution specified above and the changes be such that for some . Then there exist constants decreasing in , decreasing in , and such that

holds for all .

Proof: Please see the Appendix.

4.2 Consistency of CAPA

The results we obtained in the previous section can be extended to prove the consistency of CAPA for the detection of joint epidemic changes in mean and variance. As in the previous section, consider data which is of the form , where . Since we now assume epidemic changes, we have

Here, and are the typical mean and variance respectively and is the number of epidemic changepoints. The variables and denote the starting and end point of the th anomalous window respectively. We impose and for some . Treating the and like classical changepoints allows us to extend the definitions of , , and to the epidemic changepoint model.

The following consistency result then holds for a partition inferred by CAPA using a minimum segment length of two and for some as penalty for both point anomalies and epidemic changepoints.

Theorem 2

Let follow the distribution specified above and the changes be such that for some . Then there exist constants decreasing in , decreasing in , and such that

holds for all .

Proof: Please see the Appendix.

5 Simulation Study

To assess the potential of CAPA, we compare its performance to that of other popular anomaly and changepoint methods on simulated data. In particular, we compare with PELT as implemented in Killick and Eckley (2014), a commonly used changepoint detection method, luminol (Maheshwari et al. (2014)), an algorithm developed by LinkedIn to detect segments of atypical behaviour, as well as BreakoutDetection (James et al. (2016)) which was introduced by Twitter to detect changes in mean in a way which is robust to point anomalies.

The simulation study was conducted over simulated time series each consisting of 5000 observations, for which the typical data follows a distribution. Epidemic changepoints start at a rate of 0.0005 (corresponding to an average of about 2.5 epidemic changes in each series), with their length being i.i.d.  distributed. In each anomalous segment the data is again normally distributed, with the means being i.i.d.  distributed and standard deviations i.i.d.  distributed. We used

  1. and for weak and strong changes in mean respectively

  2. and for weak and strong changes in mean respectively

We compared the performance of the four methods in the presence of both strong and weak changes in mean and/or variance. We also repeated the analysis with 10 i.i.d.  distributed point anomalies occurring at randomly sampled points in the typical data. The comparison of these methods is made using the three different approaches we detail below.

(a) No point anomalies
(b) No point anomalies
(c) Point anomalies present
(d) Point anomalies present
Figure 2: Data examples and ROC curves for weak changes in mean for CAPA (black), PELT (red), BreakoutDetection (green), and luminol (blue).
(a) No point anomalies
(b) No point anomalies
(c) Point anomalies present
(d) Point anomalies present
Figure 3: Data examples and ROC curves for strong changes in mean for CAPA (black), PELT (red), BreakoutDetection (green), and luminol (blue).

5.1 Roc

We obtained ROC curves for the four methods. For BreakoutDetection and PELT, we considered detected changes within 20 time points of true changes to be true positives and classified all other detected changes as false positives. For luminol and CAPA, we considered detected starting and end points of epidemic changes to be true positives if they were within 20 observations of a starting and end point respectively. The results regarding the precision of true positives in Section 5.2 suggest that the results in this section are robust with regard to the choice of error tolerance. We set the minimum segment length to ten for PELT, CAPA, and BreakoutDetection. To obtain the ROC curves we varied the penalty for epidemic segments in CAPA, the penalty in PELT, the threshold in luminol, and the beta parameter of BreakoutDetection.

The resulting ROC curves, as well as examples of realisations of the data for the scenario of weak and strong changes in mean can be found in Figures 2 and 3 respectively. The results for joint changes in mean and variance, as well as changes in variance can be found in the supplementary material. We see that CAPA generally outperforms PELT, even in the absence of point anomalies. This is due to it having more statistical power, by exploiting the epidemic nature of the change. This becomes particularly apparent when the changes are weak. CAPA also outperform BreakoutDetection and luminol for epidemic changes in mean, the scenario for which these methods were developed. Moreover, the performance of CAPA is barely affected by the presence of point anomalies, unlike that of the non-robust methods. This observation remained true when we repeated our analysis with distributed point anomalies. The ROC curves for these additional simulations can be found in the supplementary material.

Mean Variance Point anomalies CAPA PELT BreakoutDetection luminol
weak - - 1.79 1.50 3.40 9.91
weak - 10 1.72 2.27 3.75 10.70
strong - - 0.16 0.61 5.38 15.99
strong - 10 0.19 0.67 4.68 15.60
- weak - 1.41 1.43 4.60 9.87
- weak 10 1.31 1.89 4.49 10.76
- strong - 0.33 0.73 5.19 12.03
- strong 10 0.33 0.79 5.17 11.29
weak weak - 1.16 1.30 4.00 11.40
weak weak 10 1.22 1.63 4.00 11.30
strong strong - 0.09 0.56 3.78 16.31
strong strong 10 0.09 0.58 3.77 15.71
Figure 4: Precision of true positives measured in mean absolute distance for CAPA, PELT, luminol, and BreakoutDetection

5.2 Precision

We also investigated the precision of the true positives for the four methods. We compared the mean absolute distance between detected changes (i.e. true changes which had a detected changes within 20 observations) and the nearest estimated change across all the 12 scenarios. We used the default penalties for all methods (i.e. the default threshold for luminol and the BIC for PELT and CAPA) except BreakoutDetection, where we found that the default penalty returned no true positives at all for many scenarios. We therefore used the results we obtained when deriving the ROC curves to set the beta parameter to an appropriate level for each case.

The results of this analysis can be found in Figure 4. We see that CAPA is generally the most precise one. Moreover, its precision is not too strongly affected by the presence of point anomalies, unlike that of PELT, whose performance is significantly deteriorated by anomalies, especially when the signal is weak. The reason for this is that PELT fits additional changes in the presence of anomalies, which results in shorter segments. This leads to less accurate parameter estimates, which results in poorer estimates for the location of the changepoint. CAPA does not face this problem since the parameter of the typical distribution is shared across all segments. This remains true when the point anomalies are are a lot stronger, as can be seen in the supplementary material.

(a) With epidemic changes
(b) Stationary data
Figure 5: Runtime of CAPA (black), PELT (red), BreakoutDetection (green), and luminol (blue)

5.3 Runtime

Finally, we investigated the relationship between the runtime of the 4 methods and the number of observations. Our comparison is based on data following a distribution identical to the one we used in Sections 5.1 and 5.2. Since this type of data favours PELT and CAPA, because the expected number of changes increases with the number of observations, we also compared the runtime of the four methods on stationary data, which represents the worst case scenario for these methods.

Figure 5 displays the average speed over 50 repetitions for the two cases. When comparing the slopes between 10000 and 50000 datapoints we note that it is very close to 2 for BreakoutDetection in both cases as well as CAPA and PELT for stationary data, suggesting quadratic scaling. In the presence of epidemic changes however, that slope is 1.26 for CAPA – 1.14 even between 25000 and 50000 datapoints – thus suggesting near linear runtime.

6 Application to Kepler Light Curve Data

(a) 62.8 days
(b) 62.9 days
(c) 63.0 days
Figure 6: CAPA applied to the light curve of Kepler 1132 preprocessed using different periods.
Figure 7: The strongest change in mean, as measured by , detected by CAPA for the lightcurve of Kepler 1132. All periods from 1 to 200 days at 0.01 day increment were examined

We now apply CAPA to the Kepler light curve data, with the aim of detecting exoplanets via the so called transit method (Sartoretti and Schneider (1999)). As described in Section 1, this approach consists of repeatedly measuring a star’s brightness for a certain period of time, thus obtaining a so called light curve. Periodically recurring dips in the measurements then point towards the transit of a planet causing a small eclipse. Since the signal of transiting planets is known to be weak, we amplify it by exploiting its periodic nature. If the period of an orbiting planet were known, the signal of its transit could be strengthened by considering all data points to have been gathered at their measurement time modulo that period. We would thus obtain an irregularly sampled time series which we can transform into a regularly sampled time series by binning the data into equally sized bins of length approximately equal to the measurement interval of the Kepler telescope and taking the average within each bin. We could then apply CAPA to this preprocessed data, which would exhibit a stronger signal for any planet with the associated period. Detecting the signal for such a planet involves detecting a collective anomaly with a reduced mean. However we need to do this whilst being robust to the point anomalies in the data, and the potentially other collective anomalies associated with planets with different periods. The results obtained by applying this method, using the default penalties of our software implementation of CAPA, to the light curve of Kepler 1132 using a period of 62.8, 62.9, and 63.0 days can be found in Figure 6. We note that using a period of 62.9 days results in a promising dip, which is not present when using 62.8 or 63 days as period.

Given a light curve, the periods of exoplanets orbiting the corresponding star (if any are present) are obviously not known a priori. We can, however, apply the above approach for a range of periods, given the fact that the cost of running CAPA is comparable to that of binning the data. Since transits appear as periods of reduced mean, we record the strength of the strongest change in mean as defined by and estimated using the sample mean and variance in the collective anomalies, and the estimated means and variance of the typical distribution. We expect this quantity to be largest for the periods of exoplanets. We identified the strength of the strongest change in mean for all periods from 1 day to 200 days with increments of 0.01 days for the light curve of Kepler 1132. The result of this analysis can be found in Figure 7. Note that the largest change in mean is recorded at a period of 62.89 days. As with spectral methods, we also observe resonance of the main signal at integer fractions of that period. This result is consistent with the existing literature, which considers Kepler 1132 to be orbited approximately every 62.892 days by the exoplanet Kepler 1132-b whose radius is about 2.5 times that of the earth (Morton et al. (2016)).

We also applied CAPA to the light curves of other stars with confirmed exoplanets and were able to detect their transit signal at the right period. A more detailed exposition of these results can be found in the supplementary material.

7 Acknowledgements

This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program. The authors would like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme Statistical Scalability when work on this paper was undertaken. This work was supported by EPSRC grant numbers EP/K032208/1 (INI), EP/R014604/1 (INI), EP/N031938/1 (StatScale), and EP/L015692/1 (STOR-i). The authors also acknowledge British Telecommunications plc (BT) for financial support and David Yearling and Kjeld Jensen in BT Research & Innovation for discussions.

8 Appendix: Proofs

This Appendix contains proofs for all the results in this papers. Proofs for Lemmata we use can be found in the supplementary material.

8.1 Proof of Proposition 1

Let . We have

which shows that the cost of choosing will always be larger than that of choosing . We can thus disregard .

8.2 Proof of Theorem 1

Before proving this theorem, we introduce some notation. We define the cost of a segment under the true partition and true parameters to be

Note that this cost is additive, i.e. for we have , whilst the fitted cost satisfies the inequality .

We also define the residual sum of squares . Finally, we will work on the event sets , , , , , and which we define below using notation

where satisfies

Note that is guaranteed to exits by the intermediate value theorem. Indeed, the function is continuous and satisfies and as . The motivation for these events is as follows: bounds the error in the estimates of the mean, while , , and bound the error in the estimates of the variance. and are needed to prevent the existence of segments length two and three respectively in which the observations lie to close to each other, which would encourage the algorithm to erroneously fit them in a short segment of low variance. We write .

We are now in a position to prove the following lemmata:

Lemma 1

(Yao 1988) , for some constant .

Lemma 2

, , , , , and for some constants ,,, , , and .

Lemma 3

There exists a constant such that holds on for all .

Lemma 4

Let be such that there exists some such that . The following holds given :

Lemma 5

Let be such that such that or . The following then holds given

Lemma 6

Let for some partition of such that such that . Then,

where holds on for large enough .

Lemma 7

For all , there exists a constant decreasing in such that holds on if and for all .

We now define , noting it decreases in , and the set of partitions

which are within of the true partition.

We will show that, for large enough , the optimal partition lies in given the event set . Given the probability of , this proves Theorem 1. Our approach will consist of showing that the cost of a partition is higher than that of the true partition with the true parameters (see Proposition 4). We will achieve this by adding free changes to thus splitting up the series into multiple sub-segments each containing a single true changepoint and points either side of it. This also defines a projection of onto the partitions of the sub-segments. We define the set of partitions

for segments for which there exist a such that: as an analogue of for the whole of .

If , there must be at least one sub-segment for which the projection of does not lie in . We will show in Proposition 3, that the cost of the true partition using the true parameters is at least lower than that of the projection of on such a segment. We will also show in Proposition 2 that the projections of which are in have a cost which is at most lower than that of the true partition with true parameters.

Proposition 2

Let , be such that there exists a such that: , then there exists a constant such that given ,

for all valid partitions of the form , if is large enough.

Proof of Proposition 2: The following cases are possible:

Case 1: . Then:

where the inequality follows from Lemma 4.

Case 2: . Then:

where the inequality follows from Lemmata 4 and 5.

Case 3: . Then:

where the first inequality follows from the fact that introducing an unpenalised changepoint reduces cost and the second is a consequence of Lemma 4.

Case 4: . Symmetrical to case 2.

Case 5: . Symmetrical to case 3.

This finishes our proof.

Proposition 3

There exists a constant , such that for which such that