# Robust Retrospective Multiple Change-point Estimation for Multivariate Data

## Abstract

We propose a non-parametric statistical procedure for detecting multiple change-points in multidimensional signals. The method is based on a test statistic that generalizes the well-known Kruskal-Wallis procedure to the multivariate setting. The proposed approach does not require any knowledge about the distribution of the observations and is parameter-free. It is computationally efficient thanks to the use of dynamic programming and can also be applied when the number of change-points is unknown. The method is shown through simulations to be more robust than alternatives, particularly when faced with atypical distributions (e.g., with outliers), high noise levels and/or high-dimensional data.

clclbCL Keywords: Change-point estimation, multivariate data, Kruskal-Wallis test, robust statistics, joint segmentation.

## 1 Introduction

Retrospective multiple change-point estimation consists in partitioning a non-stationary series of observations into several contiguous stationary segments of variable durations [1]. It is particularly appropriate for analyzing a posteriori time series in which the quantity driving the behavior of the time series jumps from one level to another at random instants called change-points. Such a task, also known as temporal signal segmentation in signal processing, arises in many applications, ranging from EEG to speech processing and network intrusion detection [2, 3].

The classic setting for change-point detection and estimation is the case where the sequence of observations is univariate and the use of the least-squares criterion is considered to be adequate, for instance, in the presence of Gaussian noise [4, 5]. When there are multiple change-points to be detected, the composite least-squares criterion can be efficiently optimized using a dynamic programming recursion whose computational cost scales like the square of the number of observations [6, 7, 8]. The approach can be generalized to multivariate settings [9].

In this work, we consider methods that can be applied to high-dimensional data without requiring
strong prior knowledge of the underlying distribution of the observations. To the best of our
knowledge, there are very few approaches in the literature that are suitable for this task. The
first option is the kernel-based approach of [10] which uses a general
kernel metric to evaluate distances between multivariate observations. This approach is powerful
but requires the choice of an appropriate kernel function and, furthermore, it is not robust in
situations where the noise level is very significant as will be illustrated in
Section 4. The other relevant reference is [11] which
combines univariate rank statistics using a weighting matrix derived from an asymptotic
analysis. This approach, which is based on the principle of the Mann-Whitney Wilcoxon
two-sample test, is however limited to the case where, at most, a single change-point is present in
the observed signal. Our contribution with this paper is to show that the test statistic of
[11] can be modified so as to deal with multiple change-points (in analogy
with the way the classical Kruskal-Wallis test generalizes the Mann-Whitney Wilcoxon statistic to
multiple group testing [12]. Furthermore, we show that the proposed test statistic
is amenable to dynamic programming and can thus be optimized efficiently despite the combinatorial
nature of the multiple change-point detection task. We will show in particular that the proposed
method, termed *dynMKW* (for “dynamic programming multivariate Kruskal-Wallis”), is more
efficient than greedy strategies based on repeated use of single change-point tests as suggested in
[13].

When dealing with multiple change-points, one needs a criterion for estimating the number of change-points which is an instance of a model selection problem. Traditional approaches include the use of a complexity penalty that is added to the test statistic [5, 7, 14] or the use of priors on the locations of change-points when adopting the Bayesian point of view, see, e.g., [15, 16] and references therein. An original approach adopted in [17, 18] is to consider the penalty associated to the use of the norm (or block- norm in [18]). In this paper, we propose a simple construction which relies on the so-called “slope heuristic” and is a variant of the procedure described in [14]. Although heuristic this approach is preferable to off-the-shelf Schwarz-like penalties (AIC, BIC, etc.) which are most often found to be badly-calibrated for the type of problems considered here.

The paper is organized as follows. The statistical methodology for testing homogeneity in several groups of multivariate observations is described in Section 2. The procedure for estimating the change-points derived from this testing procedure is then described in Section 3. In Sections 4, we report the results of numerical experiments carried out on synthetic and real data.

## 2 Testing homogeneity within several groups of multivariate data

Let be -dimensional observed random vectors such that , where denotes the transpose of the matrix . We assume in the following that are independent. In this section, we first consider testing the hypothesis that given groups, , , …, , have the same distribution, where we will use the convention that and .

For in and in , let denote the rank of among that is . Also define , for in , by

We propose to use the following statistic:

(1) |

where is defined by

(2) |

and is the matrix whose –th element is defined by

(3) |

Note that can be rewritten as

(4) |

where denotes the empirical cumulative distribution function (c.d.f.) of the th coordinate . The matrix thus corresponds to an empirical estimate of the covariance matrix with general term

(5) |

denoting the c.d.f. of which we shall assume in the sequel to be continuous.

Observe that (1) extends to the multivariate setting the classical Kruskal-Wallis test [12] which applies to the univariate data. Indeed, when , (1) is equivalent to

(6) |

where we have replaced by , where is a uniform random variable on , being a continuous cumulative distribution function. In the case where there is only one change-point, i.e. when , (1) reduces to the test statistic proposed in [11] to extend the Mann-Whitney/Wilcoxon rank test to multivariate data. We state below (without proof for reasons of space) a result which shows that (1) is properly normalized in the sense that under the null hypothesis it converges, as increases, to a fixed limiting distribution that only depends on and .

###### Theorem 1.

Assume that are -valued i.i.d. random vectors such that, for all , the c.d.f. of is a continuous function. Assume also that for , there exists in such that , as tends to infinity. Then, defined in (1) satisfies

(7) |

where denotes the convergence in distribution and denotes the chi-square distribution with degrees of freedom.

## 3 Detecting change-point locations

In this section, we consider applying the test statistic described in the previous section to determine the positions of the segment boundaries , starting with the simpler case where the number of change-points is known.

### 3.1 Estimation of a known number of change-points

To estimate the change-point locations, we propose to maximize the statistic defined in (1) with respect to the segment boundaries :

(8) |

Of course, direct maximization of (8) is impossible as it corresponds to a combinatorial problem whose complexity is exponential in . We can however perform this maximization efficiently using the dynamic programming algorithm described in [6]. More precisely, using the notations

and

we have

(9) |

Thus, for solving the optimization problem (8), we proceed as follows. We start by computing the for all such that . All the are thus available for . Then is computed by using the recursion (9) and so on. The overall numerical complexity of the procedure is thus proportional to .

### 3.2 Estimation of the number of change-points

In practice the number of change-points can rarely be assumed to be known. Although it is not the main focus of this paper, we propose a heuristic algorithm to find the optimal number of change-points. Values of the statistics , for , are first computed using the procedure described in the previous section. The algorithm is based on the principle that in the presence of change-points, if is plotted against , the resulting graph can be decomposed into two distinct linear trends: a first one, for where the criterion is growing rapidly; and a second one, for , where the criterion is barely increasing. Hence, for each possible value of in , we compute least square linear regressions for both parts of the graph (before and after ); the estimated number of change-points is the value of that yields the best fit, that is, the value for which the sum of the residual sums of squares computed on both parts of the graph is minimal (see Figure 1). The case is treated separately and the procedure that has just been described is used only when the value of is significant, based on the asymptotic -value for the single change-point case obtained in [11].

## 4 Numerical experiments

In this section, the performance of the *dynMKW* algorithm is
assessed using simulated signals and we next consider its application to data arising from a biological context.

### 4.1 Simulated data

The simulation reported is this section is based on a fixed 5-dimensional
piecewise constant signal of length 500 with a predefined set of four shared
breakpoints. Note that this signal reproduces an important feature of many
applications (see [3] as well as the example of the cancer
data below) in that only a subset of its coordinates actually change simultaneously.
To this baseline signal is added some moderately correlated Gaussian
noise with marginal variance , see Figure 2
for an example corresponding to a marginal SNR of 16 dB, where the SNR is defined as the
ratio of the jump amplitude over .
The performance of the algorithms is assessed from 1000 Monte-Carlo replications of the data
for each value of and measured in terms of *precision* (proportion of
correctly estimated changes found among the detected change-points) and
*recall* (proportion of actual change-points that have been correctly estimated).
We used a sample tolerance for deciding whether a change-point has been
correctly estimated but the conclusion were qualitatively the same for larger
values of this tolerance.

#### Known number of change-points

We compare our algorithm with two other different methods – which also rely on a dynamic programming step – for multiple change-points detection: first, with the kernel-based approach of [10] that computes “intra-segment scatter” using an isotropic Gaussian kernel function and second, the parametric method corresponding to the multivariate Gaussian assumption.

As shown in Figure 3 left, the parametric method
(“Linear”) yields the best results, as the noise added to the signal is indeed Gaussian.
Unsurprisingly, the proposed *dynMKW* approach performs slightly worse in this scenario but
is certainly preferable than the kernel test, particularly when the level of noise increases.
In Figure 3 right, we repeat the experiment with significant
contamination by outliers. The *dynMKW* method demonstrates its robustness as, contrary to both alternatives,
its performance barely suffer from the presence of outliers.

#### Unknown number of change-points

Using the same signal as previously, we suppose in the experiments presented here that the number of change-points is unknown.
Our algorithm, combined with the approach for estimating the number of
change-points presented in Section 3.2 is
compared with a greedy hierarchical method suggested by [13]:
Breakpoints are detected by iteratively testing for the presence of a single
change-point in the sub-segments determined in previous stages. This method (“*Vost*”)
is applied using (8) restricted to groups as a single change-point detector,
where the asymptotic -value determined in [11] is computed to
decide whether the current segment should be segmented any further (segmentation
stops whenever the -value is larger than a threshold, say 1 or 5%).

Results are shown in Figure 4 with precision and
recall measures. It is observed that *dynMKW* outperforms
the *Vost* procedure both in precision and recall.
From a more qualitative perspective, the *Vost* procedure
tends to systematically over-segment the signal while our method, when faulting, is
more inclined to miss some change-points.

### 4.2 Bladder tumor CGH profiles

We consider the bladder cancer micro-array aCGH dataset^{1}*dynMKW* algorithm on each chromosome separately, thus treating 23 different 9- to 57-dimensional signals (depending on the selected groups of patients at different stages of cancer) of length 50 to 200 (the number of probes varies for each chromosome).
Results are shown for a group of 9 individuals corresponding to Stage T1 of a tumor.
In Figure 5, the smoothed version of the whole set of probes resulting from the segmentation is displayed, while a focus is given on the 10th chromosome in Figure 6. In each case, the segmentation result is represented by a signal which is constant (and equal to the mean of the data) within the detected segments.

### Footnotes

### References

- B. E. Brodsky and B. S. Darkhovsky, Nonparametric Methods in Change-Point Problems, Kluwer Academic Publisher, 1993.
- M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: Theory and Applications, Prentice-Hall, 1993.
- C. Lévy-Leduc and F. Roueff, “Detection and localization of change-points in high-dimensional network traffic data,” Annals of Applied Statistics, vol. 3, no. 2, pp. 637–662, 2009.
- Y.C. Yao and S. T. Au, “Least-squares estimation of a step function,” Sankhya: The Indian Journal of Statistics, Series A,, vol. 51, no. 3, pp. 370–381, 1989.
- M. Lavielle and E. Moulines, “Least-squares estimation of an unknown number of shifts in a time series,” Journal of Time Series Analysis, vol. 21, no. 1, pp. 33–59, 2000.
- S.M. Kay, Fundamentals of statistical signal processing: detection theory, Prentice-Hall, Inc., 1993.
- E. Lebarbier, “Detecting multiple change-points in the mean of a Gaussian process by model selection,” Signal Process., vol. 85, pp. 717–736, 2005.
- B. Jackson, J. Scargle, D. Barnes, A. Sundararajan, A. Alt, P. Gioumousis, E. Gwin, P. Sangtrakulcharoen, L. Tan, and T-T. Tsai, “An algorithm for optimal partitioning of data on an interval,” IEEE Signal Processing Letters, vol. 12, pp. 105–108, 2005.
- M. S. Srivastava and K. J. Worsley, “Likelihood ratio tests for a change in the multivariate normal mean,” J. Amer. Statist. Assoc., vol. 81, no. 393, pp. 199–204, 1986.
- Z. Harchaoui and O. Cappé, “Retrospective multiple change-point estimation with kernels,” in IEEE Workshop on Statistical Signal Processing, 2007.
- A. Lung-Yut-Fong, C. Lévy-Leduc, and O. Cappé, “Robust changepoint detection based on multivariate rank statistics,” in To appear in IEEE Int. Conf. Acoust., Speech, Signal Processing, 2011.
- W.H. Kruskal and W.A. Wallis, “Use of ranks in one-criterion variance analysis,” Journal of the American statistical Association, vol. 47, no. 260, pp. 583–621, 1952.
- L. Yu. Vostrikova, “Detecting disorder in multidimensional random processes,” Soviet Math. Dokl., vol. 24, pp. 55–59, 1981.
- M. Lavielle, “Using penalized contrasts for the change-points problems,” Signal Process., vol. 85, no. 8, pp. 1501–1510, 2005.
- J.J. Ruanaidh and W.J. Fitzgerald, Numerical Bayesian Methods Applied to Signal Processing, Springer, 1996.
- P. Fearnhead, “Exact and efficient bayesian inference for multiple changepoint problems,” Statistics and Computing, vol. 16, pp. 203–213, 2006.
- Z. Harchaoui and C. Lévy-Leduc, “Multiple change-point estimation with a total variation penalty,” J. Amer. Statist. Assoc., vol. 105, no. 492, pp. 1480–1493, 2010.
- J.P. Vert and K. Bleakley, “Fast detection of multiple change-points shared by many signals using group LARS,” in Advances in Neural Information Processing Systems 23, 2010.