Subset Multivariate Collective And Point Anomaly Detection

# Subset Multivariate Collective And Point Anomaly Detection

Alexander Fisch, Idris Eckley, and Paul Fearnhead
Lancaster University, United Kingdom
###### Abstract

In recent years, there has been a growing interest in identifying anomalous structure within multivariate data streams. We consider the problem of detecting collective anomalies, corresponding to intervals where one or more of the data streams behaves anomalously. We first develop a test for a single collective anomaly that has power to simultaneously detect anomalies that are either rare, that is affecting few data streams, or common. We then show how to detect multiple anomalies in a way that is computationally efficient but avoids the approximations inherent in binary segmentation-like approaches. This approach, which we call MVCAPA, is shown to consistently estimate the number and location of the collective anomalies – a property that has not previously been shown for competing methods. MVCAPA can be made robust to point anomalies and can allow for the anomalies to be imperfectly aligned. We show the practical usefulness of allowing for imperfect alignments through a resulting increase in power to detect regions of copy number variation.

Keywords: Epidemic Changepoints, Copy Number Variations, Dynamic Programming, Outliers, Robust Statistics.

## 1 Introduction

In this article, we consider the challenge of estimating the location of both collective and point anomalies within a multivariate data sequence. The field of anomaly detection has attracted considerable attention in recent years, in part due to an increasing need to automatically process large volumes of data gathered without human intervention. See chandola2009anomaly and pimentel2014review for comprehensive reviews of the area.

chandola2009anomaly categorises anomalies into one of three categories: global, contextual, or collective. The first two of these categories are point anomalies, i.e. single observations which are anomalous with respect to the global, or local, data context respectively. Conversely, a collective anomaly is defined as a sequence of observations which together form an anomalous pattern.

In this article, we focus on the following setting: we observe a multivariate time series corresponding to observations observed across different components. Each component of the series has a typical behaviour, interspersed by time windows where it behaves anomalously. In line with the definition in chandola2009anomaly , we call the behaviour within such a time window a collective anomaly. Often the underlying cause of such a collective anomaly will affect more than one, but not necessarily all, of the components. Our aim is to accurately estimate the location of these collective anomalies within the multivariate series, potentially in the presence of point anomalies. Examples of applications in which this class of anomalies is of interest include but are not limited to genetic data (bardwell2017bayesian, ; jeng2012simultaneous, ) and brain imaging data (Epidemic:fmri:Amoc, ; Epidemic:Aston, ). In Genetics, it is of interest to detect regions of the genome containing an unusual copy number. Such genetic variations have been linked to a range of diseases including cancer (diskin2009copy, ). To detect these variations, a copy number log-ratio statistics is obtained for all locations along the genome. Segments in which the mean significantly deviates from the typical mean, 0, are deemed variations (bardwell2017bayesian, ). Jointly analysing the data of multiple individuals can allow for the detection of shared and even weaker variations (jeng2012simultaneous, ). In brain data analysis, sudden shocks can lead to certain parts of the brain exhibiting anomalous activity (Epidemic:Aston, ) before returning to the baseline level.

Whilst it may be mathematically convenient to assume that anomalous structure occurs contemporaneously within the multivariate sequence, in practice one might expect some time delays (i.e. offsets or lags), as illustrated by Figure 1. In this article we will consider two different scenarios for the alignment of related collective anomalies across different components. The first, idealised setting, is that concurrent collective anomalies perfectly align. That is we can segment our time series into windows of typical and anomalous behaviour. For each anomalous window the data from a subset of components will be collective anomalies. For some applications however, it is more realistic to assume that concurrent collective anomalies start and end at similar but not necessarily identical time points – the second setting considered in this paper.

Current approaches aimed at detecting collective anomalies can broadly be divided into state space approaches and (epidemic) changepoint methods. State space models assume that a hidden state, which evolves following a Markov chain, determines whether the time series’ behaviour is typical or anomalous. Examples of state space models for anomaly detection can be found in bardwell2017bayesian and smyth1994markov . These models have the advantage of providing very interpretable output in the form of probabilities of certain segments being anomalous. However, they are often slow to fit and typically require prior parameters which can be difficult to specify.

The epidemic changepoint model provides an alternative detection framework, built on an assumption that there is a typical behaviour from which the model deviates during certain windows. Early contributions in this area include the work of levin1985cusum . Each epidemic changepoint can be modelled as two classical changepoints, one away from and one returning back to the typical distribution. Epidemic changepoints can therefore be inferred by using classical changepoint methods such as PELT (killick2012optimal, ), Binary Segmentation (scott1974cluster, ), or Wild Binary Segmentation (fryzlewicz2014wild, ) if the data is univariate and Inspect (wang2018high, ) or SMOP (pickering2016changepoint, ) if the data is multivariate. We note, however, that this approach can lead to sub-optimal power, as it is unable to exploit the fact that the typical parameter is shared across the non-anomalous segments (CAPApaper, ).

Many epidemic changepoint methods are based on the popular circular binary segmentation (CBS) by olshen2004circular , an epidemic version of binary segmentation. A cost-function based approach for univariate data, CAPA, was introduced by CAPApaper . However, within a multivariate setting, the theoretically detectable changes can potentially be very sparse, with a few components exhibiting strongly anomalous behaviour. Alternatively they might also be dense, i.e. with a large proportion of components exhibiting potentially very weak, anomalous behaviour (cai2011optimal, ; jeng2012simultaneous, ). A range of CBS-derived methods for detecting epidemic changes has therefore been proposed such as Multisample Circular Binary Segmentation (zhang2010detecting, ) for dense changes, LRS (jeng2010optimal, ) for sparse changes, and higher criticism (donoho2004higher, ) based methods like PASS (jeng2012simultaneous, ) for both types of changes.

This article makes three main contributions, the first of which is the derivation of a moment function based test for the presence of subset multivariate collective anomalies with finite sample guarantees on false positives and power against both dense and sparse changes. The second contribution is the introduction of a new algorithm Multi-Variate Collective And Point Anomalies (MVCAPA), capable of detecting both point anomalies and potentially lagged collective anomalies in a computationally efficient manner. Finally, the article provides finite sample consistency results for MVCAPA under certain settings which are independent of the number of collective anomalies .

The article is organised as follows: We begin by introducing our modelling framework for (potentially lagged) multivariate collective anomalies in Section 2.1 and define the penalised negative saving statistic to treat the restricted case of at most one single anomalous segment without lags. Section 2.2 introduces penalty regimes, prior to examining their power in Section 2.3. We then turn to consider the problem of inferring multiple collective anomalies as well as point anomalies in Section 3, before discussing computational aspects in Section 4, and establishing finite sample consistency results in Section 5. We extend penalised saving statistic to point anomalies in Section 6 and compare MVCAPA to other approaches in Section 7. We conclude the article by applying MVCAPA to CNV data in Section 8. All proofs can be found in the supplementary material at the end of this paper. MVCAPA has been implemented in the R package anomaly which is available from https://github.com/Fisch-Alex/anomaly.

## 2 Model and Inference for a Single Collective Anomaly

### 2.1 Penalised Cost Approach

We begin by focussing on the case where collective anomalies are perfectly aligned. We consider a -dimensional data source for which time-indexed observations are available. A general model for this situation is to assume that the observation , where and index time and components respectively, has probability density function and that the parameter, , depends on whether the observation is associated with a period of typical behaviour or an anomalous window. We let denote the parameter associated with component during its typical behaviour. Let be the number of anomalous windows, with the -th window starting at and ending at and affecting the subset of components denoted by the set . We assume that anomalous windows do not overlap, so for . If collective anomalies of interest are assumed to be of length at least , we also impose . Our model then assumes that the parameter associated with observation is

 θ(i)(t)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩θ(i)1ifs1

We start by considering the case where there is at most one collective anomaly, i.e. where , and introduce a test statistic to determine whether a collective anomaly is present and, if so, when it occurred and which components were affected. The methodology will be generalised to multiple collective anomalies in Section 3. Throughout we assume that the parameter, , representing the sequence’s baseline structure, is known. If this is not the case, an approximation can be obtained by estimating robustly over the whole data, as in CAPApaper .

Given the start and end of a window, , and the set of components involved, J, we can calculate the log-likelihood ratio statistic for the collective anomaly. To simplify notation we introduce a cost,

 Ci(x(i)s+1:e,θ)=−2e∑t=s+1logf(x(i)t,θ),

defined as minus twice the log-likelihood of data for parameter . We can then quantify the saving obtained by fitting component as anomalous for the window starting at and ending at as

 Si(s,e)=Ci(x(i)(s+1):e,θ(i)0)−minθ(Ci(x(i)(s+1):e,θ)).

The log-likelihood ratio statistic is then . As the start or the end of the window, or the set of components affected, are unknown a priori, it is natural to maximise the log-likelihood ratio statistic over the range of possible values for these quantities. However, in doing so we need to take account of the fact that different J will allow different numbers of components to be anomalous, and hence will allow maximising the log-likelihood, or equivalently minimising the cost, over differing numbers of parameters. This suggests penalising the log-likelihood ratio statistic differently, depending on the size of J. That is we calculate

 maxJ,s≤e−l⎡⎣∑i∈JSi(s,e)−P(|J|)⎤⎦, (2)

where is a suitable positive penalty function of the number of components that change, and is the minimum segment length. We will detect a change if (2) is positive, and estimate its location and the set of components that are anomalous based on the values of , , and J that give the maximum of (2). Whilst we have motivated this procedure based on a log-likelihood ratio statistic, the idea can be applied with more general cost functions. For example if we believe anomalies result in a change in mean then we could use cost functions based on quadratic loss. Suitable choices for the penalty function, , will be discussed in Section 2.2.

To efficiently maximise (2),define positive constants , with , and, for , . So the s are the first differences of our penalty function . Further, let the order statistics of be

 S(1)(s,e)≥...≥S(p)(s,e),

and define the penalised saving statistic of the segment ,

Then (2) is obtained by maximising over .

The choice of and has a significant impact on the performance of the statistic. For example, if we set for , then our approach will assume all components are anomalous in any anomalous segment. Clearly, and are only well specified up to their sum and can be absorbed into without altering the properties of our statistic. However, not doing so can have computational advantages. Indeed, it removes the need of sorting when all the s are identical, which can be appropriate in certain settings discussed in the next section.

### 2.2 Choosing Appropriate Penalties

We now turn to the problem of choosing appropriate penalties for the methodology introduced in the previous section. Penalties are typically chosen in a way which controls false positives under the null hypothesis that no anomalous segments are present. However, the setting in which the penalty is to be used can also play an important role. In particular, it is typically required that under the null hypothesis

 P(^K=0)≥1−Ae−ψ(p,n), (3)

where is a constant and increases with and/or , depending on the setting. For example, in panel data (bardwell2019most, ) or brain imaging data (Epidemic:Aston, ), the number of time points may be small but we may have data from a large number of variates, . Setting is therefore a natural choice so that the false positive probability tends to 0 as increases. In a streaming data context, the number of sampled components is typically fixed, while the number of observations increases. Setting is then a natural choice. In an application where both and are large, or can increase, setting , would be a natural choice.

Our approach to constructing appropriate penalty regimes is to construct a bound on the probability that under the null separately for all and all windows . We will derive explicit results for the case in which collective anomalies are characterised by an anomalous mean – arguably one of the most common settings in applications – and point to generalisations where appropriate. As mentioned in the previous section, we assume throughout this section that the parameter of the typical distribution is known.

Assume without loss of generality that the typical mean is equal to 0 and the typical standard deviation is equal to 1. Under the the null model, which we denote , we therefore have that

 x(c)ti.i.d.∼ϵt,c

where is Gaussian white noise. Denote the mean for data sequence within a window as . The cost function obtained if we model the data in this way is the residual sum of squares. Under this cost function, for any ,

 Sm(s,e)=e∑t=s+1((x(m)t)2−(x(m)t−¯x(m)s+1:e)2)=(e−s)(¯x(m)(s+1):e)2.

Furthermore, under , this saving is -distributed.

We present three distinct penalty regimes that give valid choices for the penalties for this setting, each with different properties. All penalty regimes will be indexed by a parameter which corresponds to the exponent of the probability bound, as in (3). A first strategy for defining penalties consists of just using a single global penalty.

Penalty Regime 1: and .

The following result based on standard tail bounds of the distribution shows that this penalty regime controls the overall type 1 error rate at the desired level

###### Proposition 1.

Let x follow and let denote the number of inferred collective anomalies under penalty regime 1. Then, there exists a constant such that .

Under this penalty, we would infer that any detected anomaly region will affect all components. This is inappropriate, and is likely to lead to a lack of power, if we have anomalous regions that only affect a small number of components. For this type of anomalies, the following regime offers a good alternative as it typically penalises fitting collective anomalies with few components substantially less:

Penalty Regime 2: and , for .

The following results based on tail bounds for the -distribution shows that this penalty controls false positives:

###### Proposition 2.

Let x follow and let denote the number of inferred collective anomalies under penalty regime 2. Then, there exists a constant such that .

Comparing penalty regime 2 with penalty regime 1 we see that it has a lower penalty for small , but a much higher penalty for . As such it has higher power against collective anomalies affecting few components, but low power if the collective anomalies affect most components. Taking the point-wise minimum between both penalties therefore provides a balanced approach, providing power against both types of changes (see Figure 2.) Moreover, this approach can be generalised to settings other than the change in mean case provided the distribution of the savings is sub-exponential under the null distribution. Indeed, the first penalty regime is derived from a Bernstein bound and the second one from an exponential Chernoff bound on the tail.

For the special case considered here, in which the savings follow a distribution under the null hypothesis a third penalty regime can be derived:

Penalty Regime 3: , where is the PDF of the distribution and is defined via the implicit equation .

It satisfies the following proposition regarding false positive control

###### Proposition 3.

Let x follow and let denote the number of inferred collective anomalies under penalty regime 3. Then, there exists a constant such that .

Moreover, as can be seen from Figure 2, this penalty regime provides a good alternative to the other penalty regimes, especially in intermediate cases. It can be generalised to other distributions provided quantiles of can be estimated and exponential bounds for can be derived under the null hypothesis.

In practice, to maximise power, we choose so that the resulting penalty function , is the point-wise minimum of the penalty functions , , and resulting from using penalty regimes 1, 2, and 3 respectively. We call this the composite regime. It is a corollary from Propositions 1, 2, and 3, that this composite penalty regime achieves when x follows . It should be noted that the term comes from a Bonferroni correction over all possible start and end points. Fixing a maximum segment length therefore reduces the to .

Whilst the above propositions give guidance regarding the shape of the penalty function as well as a finite sample bound on the probability of false positives they do not give an exact false positive rate. A user specified rate can nevertheless be obtained by scaling the penalties and by a constant. This single constant can then be tuned using simulations on data exhibiting no anomalies.

### 2.3 Results on Power

We now examine the power of the penalised saving statistic at detecting an anomalous segment when using the thresholds defined in the previous section. In particular, we examine its behaviour under a fixed and large regime. It is well known (jeng2012simultaneous, ; donoho2004higher, ; cai2011optimal, ), that different regimes determining the detectability of collective anomalies apply in this setting depending on the proportion of affected components. We follow the asymptotic parametrisation of jeng2012simultaneous and therefore assume that

 x(i)t=v(i)μ+ϵ(i)t,v(i)∼{0with prob.1−p−ξ,1with prob.p−ξ,andϵ(i)t∼N(0,1),fors

Typically (jeng2012simultaneous, ), changes are characterised as either sparse or dense. In a sparse change, only a few components are affected. Such changes can be detected based on the saving of those few components being larger than expected after accounting for multiple testing. The affected components therefore have to experience strong changes to be reliably detectable. On the other hand, a dense change is a change in which a large proportion of components exhibits anomalous behaviour. A well defined boundary between the two cases exist with corresponding to dense and corresponding to sparse changes (jeng2012simultaneous, ; enikeeva2013high, ). The changes affecting the individual components can consequently be weaker and still be detected by averaging over the affected components. Depending on the setting, the change in mean is parametrised by in the following manner:

 (e−s)μ2={2rplog(p)12<ξ<1,p−2rp0≤ξ≤12.

Both jeng2012simultaneous and cai2011optimal derive detection boundaries for , separating changes that are too weak to be detected from those changes strong enough to be detected. For the case in which the standard deviation in the anomalous segment is the same as the typical standard deviation, the detectability boundaries correspond to

 ρ−=⎧⎪⎨⎪⎩(1−√1−ξ)234<ξ<1,ξ−1212<ξ≤34,

for the sparse case () and

 ρ+=(12−ξ)

for the dense case (). The following proposition establishes that the penalised saving statistic has power against all sparse changes within the detection boundary, as well as power against most dense changes within the detection boundary

###### Proposition 4.

Let the series contain an anomalous segment , which follows the model specified in equation 4. Let if or if . Then the number of collective anomalies, , estimated by MVCAPA using the composite penalty regime with on the data , satisfies

 P(^K≠0)→1asp→∞.

Whilst the above assumed to be fixed, the result also holds if . Moreover, rather than requiring to be 0, or a common value , it is trivial to extend the result to the case where are i.i.d. random variables whose magnitude exceeds with probability . It is worth noticing that the third penalty regime is required to obtain optimal power against the intermediate sparse setting .

## 3 Inference for Multiple Anomalies

### 3.1 Inference for K Collective Anomalies and Point Anomalies

We now turn to the problem of generalising the methodology introduced in Section 2.1 to infer multiple collective anomalies. We will then borrow methodology from CAPApaper to incorporate point anomalies within the inference. A natural way of extending the methodology introduced in Section 2.1 to infer multiple collective anomalies in various ways, is to maximise the penalised saving jointly over the number and location of potentially multiple anomalous windows. That is we infer , ,…, by directly maximising

 ^K∑k=1S(^sk,^ek), (5)

subject to and .

It is well know from the literature that many methods designed to detect changes, or collective anomalies, are vulnerable to point anomalies (fearnhead2019changepoint, ; CAPApaper, ). Distinguishing between point and collective anomalies only makes sense if they are different, that is collective anomalies are longer than length 1. Under such an assumption, our approach can be extended to model both point and collective anomalies.

Borrowing the approach of CAPApaper , a point anomaly can be modelled as an epidemic changepoint of length 1 occurring during a segment of typical behaviour. Joint inference on collective and point anomalies can then be performed by maximising the penalised saving

 ^K∑k=1S(^sk,^ek)+∑t∈OS′(xt), (6)

with respect to , ,…, , and the set of point anomalies , subject to , . Here, is the saving of introducing a point anomaly.

For example, when collective anomalies are characterised by changes in mean

 S′(xt)=p∑i=1max⎛⎜⎝⎛⎝x(i)t−μ(i)0σ(i)0⎞⎠2−β′,0⎞⎟⎠.

The penalised savings and , as we assume point anomalies to be sparse. Suitable choices for will be discussed in the next subsection.

### 3.2 Penalties for Point Anomalies

Penalties for point anomalies can be chosen with the aim of controlling false positives under the null hypothesis, that no collective or point anomalies are present. When collective anomalies are characterised by a change in mean the null hypothesis is identical to that in Section 2.2. The following proposition holds for any penalty :

###### Proposition 5.

Let hold and denote the set of point anomalies inferred by MVCAPA using as penalty for point anomalies. Then, there exists a constant such that

 P(^O=∅)≥1−Anpe−12β′.

This suggests setting , where is as in Section 2.2.

## 4 Computation

We now turn to the problem of maximising the penalised saving introduced in the previous section. The standard approach to extend a method for detecting an anomalous window to detecting multiple anomalous windows is through circular binary segmentation (CBS, olshen2004circular ) – which repeatedly applies the method for detecting a single anomalous window or point anomaly. Such an approach is equivalent to using a greedy algorithm to approximately maximise the penalised saving and has computational cost of , where is the maximal length of collective anomalies and is the number of observations. Consequently, the runtime of CBS is if no restriction is placed on the length of collective anomalies. We will show in this section that we can directly maximise the penalised saving by using a pruned dynamic programme. This enables us to jointly estimate the anomalous windows, at the same or at a lower computational cost than CBS.

1. Only collective anomalies. The penalised saving defined in (5) can be maximised exactly using a dynamic programme. Indeed, defining to be the largest penalised saving for all observations up to, and including, the time , the following relationship holds:

 C(m)=max(C(m−1),max0≤t≤m−l(C(t)+S(t,m))).

It should be noted that calculating is, on average, an operation, since it requires sorting the savings made from introducing a change in each component. This sorting is not required when all are identical. Setting all to the same value, as in penalty regime 1, therefore reduces the computational cost to . For a maximum segment length , the computational cost of this dynamic programme approach scales like . If no maximum segment length is specified, it therefore scales quadratically in . In this setting, it therefore achieves the same run-time as CBS in both cases.

2. Collective and point anomalies. The saving in (6) can be minimised exactly via a slight modification of the previous dynamic programme. Indeed, writing for the largest penalised saving of all observations up to and including time , the relationship

 C(m)=max(C(m−1),max0≤t≤m−l(C(t)+S(t,m)),C(m−1)+S′(xt))

holds for . The previous observations regarding the computational complexity in and remain valid.

3. Pruning the dynamic programme. Solving the whole dynamic programme if no maximum segment length is specified has a computational cost increasing quadratically in . However, the solution space of the dynamic programme can be pruned in a fashion similar to killick2012optimal and CAPApaper to reduced this computational cost. Indeed, the following proposition holds:

###### Proposition 6.

Let the costs be such that

 minθ(c∑t=a+1Ci(xt,θ))≥minθ(b∑t=a+1Ci(xt,θ))+minθ(c∑t=b+1Ci(xt,θ))

holds for all x and such that and . Then, if for some there exists an such that

 C(m)−α−p∑1βi>C(t)+S(t,m),

then, for all ,

 C(m′)>C(t)+S(t,m′).

A wide range of cost functions, such as the negative log-likelihood and the sum of squares satisfy the condition required by the above proposition. The proposition implies that if for some there exists an such that

 C(m)−α−p∑1βi>C(t)+S(t,m)

holds, can be dropped as an option from the dynamic programme for all steps after step , thus reducing the cost of the algorithm. As a result of this pruning we found the runtime of MVCAPA to be close to linear in , when the number of collective anomalies increased linearly with .

## 5 Accuracy of Detecting and Locating Multiple Collective Anomalies

Whilst we have shown our method has good properties when detecting a single anomalous window, it is natural to ask whether the extension to detecting multiple anomalous windows will be able to consistently infer the number of anomalous windows and accurately estimate their locations. Developing such results for the joint detection of sparse and dense collective anomalies is notoriously challenging, as can be seen from the fact that previous work on this problem (jeng2012simultaneous, ) has not provided any such results. Another new feature of this proof is that the results allow for the number of anomalous segments to increase, whereas most results in the related changepoint literature (e.g. fryzlewicz2014wild ) assume to be fixed.

Consider a multivariate sequence , which is of the form where the mean follows a subset multivariate epidemic changepoint model with epidemic changepoints in mean. For simplicity, we assume that within an anomalous window all affected components experience the same change in mean, and that the noise process is i.i.d. Gaussian although the results extend to sub-Gaussian noise, i.e.

 μ(i)(t)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩μ1ifs1

Consider also the following choice of penalty, which is very similar to the the point-wise minimum between penalty regimes 1 and 2:

 m∑i=1βi={Cψ+Cmlog(p)ifm≤k∗,p+Cψ+C√pψifm>k∗. (8)

Here, is a constant, sets the rate of convergence and the threshold

 k∗=p1/2ψlog(p),

is defined as the threshold separating sparse changes from dense changes.

Anomalous regions can be easier or harder to detect depending on the strength of the change in mean characterising them and the number of components they affect. This intuition can be quantified by

 △2k=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩μ2klog(p)+ψ|Jk|−1if|Jk|≤k∗,μ2k√pψ|Jk|−1+ψ|Jk|−1if|Jk|>k∗,

which we define to be the signal strength of the th anomalous region. The following consistency result then holds

###### Theorem 1.

There exists a global constant such that the inferred partition
obtained by applying MVCAPA using the penalty regime specified in (8) on data x which follows the distribution specified in (7) satisfies

 P(^K=K,|^sk−sk|<10C△2k,|^ek−ek|<10C△2k)>1−An3e−ψ, (9)

provided that

 ek−sk≥40C△2k,sk+1−ek≥40C△2k,sk−ek−1≥40C△2k

holds for .

The result is proved in the appendix. This finite sample result holds for a fixed , which is independent of , , , and/or . When , the threshold is identical to that in jeng2012simultaneous . However, if is chosen to increase with , so will . This formalises the intuition that when , all changes are in some sense sparse.

## 6 Incorporating Lags

So far, we have assumed that the collective anomalies were characterised by the model specified in (1), which assumes all anomalous windows are perfectly aligned. In some applications, such as the vibrations recorded by seismographs at different locations, certain components will start exhibiting atypical behaviour later and/or return to the typical behaviour earlier. An example can be found in Figure 0(b). It is possible to extend the model in (1) to allow for this behaviour by allowing lags in the start or end of each window:

 θ(i)(t)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩θ(i)1ifs1+d(i)1

Here the start and end lag of the th component during the th anomalous window are denoted, respectively, by and , for some maximum lag-size, , and satisfy . The remaining notation is as before.

We can extend our penalised likelihood/penalised cost approach to this setting. We begin by extending the test statistic defined in Section 2.1 and the inference procedure in Section 3 to allow for lags of up to , before discussing modifications to the penalties. We conclude this section by introducing ways of making the method computationally efficient, leading to a computational cost increasing only linearly in .

### 6.1 Extending the Test Statistic

The statistic introduced in Section 2.1 can easily be extended to incorporate lags. The only modification this requires is to re-define the saving to be

 max0≤d(i),f(i)≤we−s−d(i)−f(i)≥l[Ci(x(i)(s+1+d(i)):(e−f(i)),θ(i)0)−minθ(Ci(% x(i)(s+1+d(i)):(e−f(i)),θ))−γ(I(d(i)≠0∨f(i)≠0))],

where is the maximal allowed lag, and is a penalty for adding a lag. We then infer , , ,…, by directly maximising the penalised saving

 ^K∑k=1S(^sk,^ek)+∑t∈OS′(xt), (11)

with respect to , ,…, , and the set of point anomalies , subject to , and .

### 6.2 Modifying the Penalties

As discussed in Section 2.2, the main purpose of the penalties is to account for multiple testing. Introducing lags means searching over more possible start and end points, i.e. testing more hypotheses. Consequently, increased penalties are required. The following modified version of penalty regime 2 can be used to account for lags:

Penalty Regime 2’: and , for .

Here is a (small) constant. The following proposition shows that penalty regime 2’ controls false positives at the desired level.

###### Proposition 7.

Let x follow and let denote the number of collective anomalies inferred by MVCAPA under regime 2’. Then, there exists a constant such that

 P{^K=0}≥1−n2exp(A′2(1+ϵϵ)3)e−ψ

An alternative to this penalty regime consists of using penalty regime 2, but setting the penalty .

Unlike penalty regime 2, which is based on a tail bound, regimes 1 and 3 are based on Bernstein-type mean bounds. However, the MGF of the maximum of correlated chi-squared distributions is not analytically tractable. Consequently we limited ourselves to extending regime 2.

### 6.3 Computational Considerations

The dynamic programming approach described in Section 4 can also be used to minimise the penalised negative saving in Equation (11). Solving the dynamic programme requires the computation of for for all permissible at each step of the dynamic programme. Computing these savings ex nihilo every time leads to the computational cost of the dynamic programme to scale quadratically in .

However, it is possible to reduce the computational cost of including lags by storing the savings

 Ci(x(i)(a+1):b,θ(i)0)−minθ[Ci(x(i)(a+1):b,θ)]

for and . These can then be updated in each step of the dynamic programme at a cost of at most . From these, it is possible to calculate all required for a step of the dynamic programme in just comparisons. This reduces the computational cost of each step of the dynamic programme to or , depending on the type of penalty used. Crucially, only the comparatively cheap operations of allocating memory and finding the maximum of two numbers now increase with and even this relationship is only linear.

### 6.4 Pruning the Dynamic Programme

Even when lags are included in the model, the solution space of the dynamic programme can still be pruned in a fashion similar to killick2012optimal and CAPApaper . Indeed, the following generalisation of Proposition 6 holds:

###### Proposition 8.

Let the costs be such that

 minθ(c∑t=a+1Ci(xt,θ))≥minθ(b∑t=a+1Ci(xt,θ))+minθ(c∑t=b+1Ci(xt,θ))

holds for all x and such that and . Then, if for some there exists an such that

 C(m)−α−p∑1βi>C(t)+S(t,m)

holds,

 C(m′)>C(t)+S(t,m′)

must also holds for all .

## 7 Simulation Study

We now compare the performance of MVCAPA to that of other popular methods. In particular, we compare ROC curves, precision, as well as the runtime with PASS (jeng2012simultaneous, ) and Inspect (wang2018high, ; wang2016inspectchangepoint, ). PASS (jeng2012simultaneous, ) uses higher criticism in conjunction with circular binary segmentation (olshen2004circular, ) to detect subset multivariate epidemic changepoints. Code is available from the author’s website. Inspect (wang2018high, ) uses projections to find sparse classical changepoints.

The comparison was carried out on simulated multivariate time series with observations for components with i.i.d.  noise, for a range of values of . To these, collective anomalies affecting components occurring at a geometric rate of 0.001 (leading to an average of about 5 collective anomalies per series) were added. The lengths of these collective anomalies are i.i.d. Poisson-distributed with mean 20. Within a collective anomaly, the start and end lags of each component are drawn uniformly from the set , subject to their sum being less than the length of the collective anomaly. Note that implies the absence of lags. The means of the components during the collective anomaly are drawn from an -distribution. In particular we considered the following cases, emulating different detectable regimes introduced in Section 2.3.

1. The most sparse regime possible: a single component affected by a strong anomaly without lags, i.e. , , and .

2. The most dense regime possible: all components affected by weak anomalies without lags, i.e. , , and .

3. A regime close to the boundary between sparse and dense changes, i.e.  when and when with and .

4. A regime close to the boundary between sparse and dense changes, but with lagged collective anomalies, i.e. the same as 3 but with .

This analysis was repeated with 5 point anomalies distributed . The -scaling of the variance ensures that the point anomalies are anomalous even after correcting for multiple testing over the different components.

### 7.1 ROC Curves

We obtained ROC curves by varying the threshold parameters of Inspect and PASS and by rescaling for MVCAPA. The curves were obtained over 1000 simulated datasets. For MVCAPA, we typically set , but also tried and for the third and fourth setting. We used and rescaled the composite penalty regime (Section 2.2) for and penalty regime 2’ for . We also set the maximum segment lengths for both MVCAPA and PASS to 100 and the minimum segment length of MVCAPA to 2. The parameter of PASS, which excludes the lowest -values from the higher criticism statistic to obtain a better finite sample performance (see jeng2012simultaneous ) was set to or 5, whichever was the smallest. For MVCAPA and PASS, we considered a detected segment to be a true positive if its start and end point both lie within 20 observations of that of a true collective anomalies’ start and end point respectively. For Inspect, we considered a detected change to be a true positive if it was within 20 observation of a true start or end point. When point anomalies were added to the data, we considered segments of length one returned by PASS to be point anomalies to make the comparison with MVCAPA fairer.

The results for three of settings considered can be found in Figures 3 to 5. The qualitatively similar results for the second setting can be found in the supplementary material. We can see that Inspect usually does worst, especially when changes become dense, which is no surprise given the method was introduced to detect sparse changes. We additionally see that MVCAPA generally outperforms PASS. This advantage is particularly pronounced in the case in which exactly one component changes. This is a setting which PASS has difficulties dealing with due to the convergence properties of the higher criticism statistic at the lower tail (jeng2012simultaneous, ). PASS outperformed MVCAPA in the second setting for , when it was assisted by a large value of , which considerably reduced the number candidate collective anomalies it had to consider.

Figures 4 and 5, show that MVCAPA performs best when the correct maximal lag is specified. They also demonstrate that specifying a lag and therefore overestimating the lag when no lag is present adversely affects performance of MVCAPA. However, when lags are present, over-estimating the maximal lag appears preferable to underestimating it. Finally, the comparison between Figures 4(c) and 4(d) shows that the performance of MVCAPA is hardly affected by the presence of point anomalies, unlike that of Inspect and, to a lesser extent, PASS, whose performance is adversely affected.

### 7.2 Precision

We compared the precision of the three methods by measuring the accuracy (in mean absolute distance) of true positives. Only true positives detected by all methods were taken into account to avoid selection bias. We used the default parameters for MVCAPA and PASS, whilst we set the threshold for Inspect to a value leading to comparable number of true and false positives. To ensure a suitable number of true positives for Inspect we doubled sigma in the second scenario. The results of this analysis can be found in Figure 6 and show that MVCAPA is usually the most precise approach, exhibiting a significant gain in accuracy against PASS. When comparing the influence of the user-specified maximal lag, we note that specifying he correct maximal lag gives the best performance. We also note that over- and under- estimating the lag have a similar effects on the accuracy.

### 7.3 Runtime

We compared the scaling of the runtime of MVCAPA, PASS, and Inspect in both the number of observations , as well as the number of components . To evaluate the scaling in we set and varied on data without any anomalies. We repeated this analysis with collective anomalies appearing (on average) every 100 observations. The results of these two analyses can be found in Figures 6(a) and 6(b) respectively. We note that the slope of MVCAPA is very close to 2, in the anomaly-free setting and very close to 1 in the setting in which the number of anomalies increases linearly with the number of observations, suggesting quadratic and linear behaviour respectively, whilst the slopes of PASS and Inspect are close to 2 and 1 respectively in both cases.

Turning to the scaling of the three methods in , we set = 100 and varied . The results of this analysis can be found in Figure 6(c). We note that the slopes of all methods are close to 1 suggesting linear behaviour. However, Inspect becomes very slow once exceeds a certain threshold.

## 8 Application

We now apply MVCAPA to extract copy number variations (CNVs) from genetics data. The data consists of a log-likelihood ratio statistic evaluated along the genome which measures the likelihood of a CNV. A multivariate approach to detecting CNVs is attractive because they are often shared across individuals. By borrowing signal across individuals we should gain power for detecting CNVs which have a weak signal. However, as we will become apparent from our results, shared variations do not always align perfectly.

In this section we re-use the design of bardwell2017bayesian to compare MVCAPA with PASS. The only difference is that we set the maximum segment length for MVCAPA and PASS to 100, whilst bardwell2017bayesian used 200. To investigate the potential benefit of allowing for lags, we repeated the experiment for MVCAPA both with (i.e. not allowing for lags) and . Since in this application, we used the sparse penalty setting for MVCAPA.

The exact ground truth is unknown. Indeed, it is beyond the scope of this paper to differentiate between a false positive and a currently unknown CNV. We can nevertheless compare different methods by how accurately they detect known copy number variations for a given size. Like bardwell2017bayesian , we used known CNVs from the HapMap project (international2003international, ) as true positives and tuned the penalties and thresholds in such a way that 4% of the genome was flagged up as anomalous. For MVCAPA this involved scaling the penalties by a constant, as discussed in the final paragraph of Section 2.2.

The results of this analysis can be found in Figure 8. These tables show that MVCAPA compares favourably with PASS. We can also see that allowing for lags generally led to a better performance of MVCAPA, thus suggesting non-perfect alignment of CNVs across individuals. Moreover, MVCAPA was very fast taking 5 seconds to analyse the longer genome on a standard laptop when we did not allow lags, and 10 seconds when we allowed for lags. The R implementation of PASS, on the other hand, took 17 minutes.

## 9 Acknowledgements

This work was supported by EPSRC grant numbers EP/N031938/1 (StatScale) and EP/L015692/1 (STOR-i). The authors also acknowledge British Telecommunications plc (BT) for financial support, David Yearling and Kjeld Jensen in BT Research & Innovation for discussions. Thanks also to Lawrence Bardwell for providing us with data and code reproducing results from his paper.

## References

• [1] John AD Aston and Claudia Kirch. Evaluating stationarity via change-point alternatives with applications to fMRI data. The Annals of Applied Statistics, 6(4):1906–1948, 2012.
• [2] Lawrence Bardwell and Paul Fearnhead. Bayesian detection of abnormal segments in multiple time series. Bayesian Analysis, 12(1):193–218, 2017.
• [3] Lawrence Bardwell, Paul Fearnhead, Idris A Eckley, Simon Smith, and Martin Spott. Most recent changepoint detection in panel data. Technometrics, 61(1):88–98, 2019.
• [4] Stéphane Boucheron and Maud Thomas. Concentration inequalities for order statistics. Electronic Communications in Probability, 17, 2012.
• [5] Tony Cai, Jessie Jeng, and Jiashun Jin. Optimal detection of heterogeneous and heteroscedastic mixtures. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):629–662, 2011.
• [6] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.
• [7] International HapMap Consortium et al. The international hapmap project. Nature, 426(6968):789, 2003.
• [8] Sharon J Diskin, Cuiping Hou, Joseph T Glessner, Edward F Attiyeh, Marci Laudenslager, Kristopher Bosse, Kristina Cole, Yaël P Mossé, Andrew Wood, Jill E Lynch, et al. Copy number variation at 1q21. 1 associated with neuroblastoma. Nature, 459(7249):987, 2009.
• [9] David Donoho, Jiashun Jin, et al. Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3):962–994, 2004.
• [10] Farida Enikeeva and Zaid Harchaoui. High-dimensional change-point detection with sparse alternatives. arXiv preprint arXiv:1312.1900, 2013.
• [11] Paul Fearnhead and Guillem Rigaill. Changepoint detection in the presence of outliers. Journal of the American Statistical Association, 114(525):169–183, 2019.
• [12] Alexander T M Fisch, Idris A Eckley, and Paul Fearnhead. A linear time method for the detection of point and collective anomalies. ArXiv e-prints, June 2018.
• [13] Piotr Fryzlewicz. Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6):2243–2281, 2014.
• [14] X Jessie Jeng, T Tony Cai, and Hongzhe Li. Optimal sparse segment identification with application in copy number variation analysis. Journal of the American Statistical Association, 105(491):1156–1166, 2010.
• [15] X Jessie Jeng, T Tony Cai, and Hongzhe Li. Simultaneous discovery of rare and common segment variants. Biometrika, 100(1):157–172, 2012.
• [16] Rebecca Killick, Paul Fearnhead, and Idris A Eckley. Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590–1598, 2012.
• [17] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
• [18] Bruce Levin and Jennie Kline. The cusum test of homogeneity with an application in spontaneous abortion epidemiology. Statistics in Medicine, 4(4):469–488, 1985.
• [19] Adam B Olshen, E S Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics, 5(4):557–572, 2004.
• [20] Benjamin Pickering. Changepoint detection for acoustic sensing signals. PhD thesis, Lancaster University, 2016.
• [21] Marco A F Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
• [22] Lucy F Robinson, Tor D Wager, and Martin A Lindquist. Change point estimation in multi-subject fMRI studies. Neuroimage, 49(2):1581–1592, 2010.
• [23] Alastair J Scott and M Knott. A cluster analysis method for grouping means in the analysis of variance. Biometrics, pages 507–512, 1974.
• [24] Padhraic Smyth. Markov monitoring with unknown states. IEEE Journal on Selected Areas in Communications, 12(9):1600–1612, 1994.
• [25] Tengyao Wang and Richard J Samworth. Inspectchangepoint: high-dimensional changepoint estimation via sparse projection. R Package Version, 1, 2016.
• [26] Tengyao Wang and Richard J Samworth. High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):57–83, 2018.
• [27] Nancy R Zhang, David O Siegmund, Hanlee Ji, and Jun Z Li. Detecting simultaneous changepoints in multiple sequences. Biometrika, 97(3):631–645, 2010.

## 10 Supplementary Material

### 10.1 Proofs for Theorems and Propositions

#### 10.1.1 Proof of Proposition 1

Let . The probability that the segment is not flagged up as anomalous is given by

 P(∑c∈Sm(j−i+1)(¯ηi:j)2

where inequality follows from the bounds on the chi-squared distribution in [17]. A Bonferroni correction over all possible pairs then finishes the proof.

#### 10.1.2 Proof of Proposition 2

Let . For this pair define , noting that they are all independent and -distributed. The probability that this segment will not be considered anomalous is

 P(∑c∈SmYc<2(1+ϵ)ψ+2mlog(p),∀Sm⊂{1,...,p}:|Sm|=m,1≤m≤p) ≥P(p∑i=1(Y