Robust Retrospective Multiple Change-point Estimation for Multivariate Data
We propose a non-parametric statistical procedure for detecting multiple change-points in multidimensional signals. The method is based on a test statistic that generalizes the well-known Kruskal-Wallis procedure to the multivariate setting. The proposed approach does not require any knowledge about the distribution of the observations and is parameter-free. It is computationally efficient thanks to the use of dynamic programming and can also be applied when the number of change-points is unknown. The method is shown through simulations to be more robust than alternatives, particularly when faced with atypical distributions (e.g., with outliers), high noise levels and/or high-dimensional data.
clclbCL Keywords: Change-point estimation, multivariate data, Kruskal-Wallis test, robust statistics, joint segmentation.
Retrospective multiple change-point estimation consists in partitioning a non-stationary series of observations into several contiguous stationary segments of variable durations . It is particularly appropriate for analyzing a posteriori time series in which the quantity driving the behavior of the time series jumps from one level to another at random instants called change-points. Such a task, also known as temporal signal segmentation in signal processing, arises in many applications, ranging from EEG to speech processing and network intrusion detection [2, 3].
The classic setting for change-point detection and estimation is the case where the sequence of observations is univariate and the use of the least-squares criterion is considered to be adequate, for instance, in the presence of Gaussian noise [4, 5]. When there are multiple change-points to be detected, the composite least-squares criterion can be efficiently optimized using a dynamic programming recursion whose computational cost scales like the square of the number of observations [6, 7, 8]. The approach can be generalized to multivariate settings .
In this work, we consider methods that can be applied to high-dimensional data without requiring strong prior knowledge of the underlying distribution of the observations. To the best of our knowledge, there are very few approaches in the literature that are suitable for this task. The first option is the kernel-based approach of  which uses a general kernel metric to evaluate distances between multivariate observations. This approach is powerful but requires the choice of an appropriate kernel function and, furthermore, it is not robust in situations where the noise level is very significant as will be illustrated in Section 4. The other relevant reference is  which combines univariate rank statistics using a weighting matrix derived from an asymptotic analysis. This approach, which is based on the principle of the Mann-Whitney Wilcoxon two-sample test, is however limited to the case where, at most, a single change-point is present in the observed signal. Our contribution with this paper is to show that the test statistic of  can be modified so as to deal with multiple change-points (in analogy with the way the classical Kruskal-Wallis test generalizes the Mann-Whitney Wilcoxon statistic to multiple group testing . Furthermore, we show that the proposed test statistic is amenable to dynamic programming and can thus be optimized efficiently despite the combinatorial nature of the multiple change-point detection task. We will show in particular that the proposed method, termed dynMKW (for “dynamic programming multivariate Kruskal-Wallis”), is more efficient than greedy strategies based on repeated use of single change-point tests as suggested in .
When dealing with multiple change-points, one needs a criterion for estimating the number of change-points which is an instance of a model selection problem. Traditional approaches include the use of a complexity penalty that is added to the test statistic [5, 7, 14] or the use of priors on the locations of change-points when adopting the Bayesian point of view, see, e.g., [15, 16] and references therein. An original approach adopted in [17, 18] is to consider the penalty associated to the use of the norm (or block- norm in ). In this paper, we propose a simple construction which relies on the so-called “slope heuristic” and is a variant of the procedure described in . Although heuristic this approach is preferable to off-the-shelf Schwarz-like penalties (AIC, BIC, etc.) which are most often found to be badly-calibrated for the type of problems considered here.
The paper is organized as follows. The statistical methodology for testing homogeneity in several groups of multivariate observations is described in Section 2. The procedure for estimating the change-points derived from this testing procedure is then described in Section 3. In Sections 4, we report the results of numerical experiments carried out on synthetic and real data.
2 Testing homogeneity within several groups of multivariate data
Let be -dimensional observed random vectors such that , where denotes the transpose of the matrix . We assume in the following that are independent. In this section, we first consider testing the hypothesis that given groups, , , …, , have the same distribution, where we will use the convention that and .
For in and in , let denote the rank of among that is . Also define , for in , by
We propose to use the following statistic:
where is defined by
and is the matrix whose –th element is defined by
Note that can be rewritten as
where denotes the empirical cumulative distribution function (c.d.f.) of the th coordinate . The matrix thus corresponds to an empirical estimate of the covariance matrix with general term
denoting the c.d.f. of which we shall assume in the sequel to be continuous.
where we have replaced by , where is a uniform random variable on , being a continuous cumulative distribution function. In the case where there is only one change-point, i.e. when , (1) reduces to the test statistic proposed in  to extend the Mann-Whitney/Wilcoxon rank test to multivariate data. We state below (without proof for reasons of space) a result which shows that (1) is properly normalized in the sense that under the null hypothesis it converges, as increases, to a fixed limiting distribution that only depends on and .
Assume that are -valued i.i.d. random vectors such that, for all , the c.d.f. of is a continuous function. Assume also that for , there exists in such that , as tends to infinity. Then, defined in (1) satisfies
where denotes the convergence in distribution and denotes the chi-square distribution with degrees of freedom.
3 Detecting change-point locations
In this section, we consider applying the test statistic described in the previous section to determine the positions of the segment boundaries , starting with the simpler case where the number of change-points is known.
3.1 Estimation of a known number of change-points
To estimate the change-point locations, we propose to maximize the statistic defined in (1) with respect to the segment boundaries :
Of course, direct maximization of (8) is impossible as it corresponds to a combinatorial problem whose complexity is exponential in . We can however perform this maximization efficiently using the dynamic programming algorithm described in . More precisely, using the notations
Thus, for solving the optimization problem (8), we proceed as follows. We start by computing the for all such that . All the are thus available for . Then is computed by using the recursion (9) and so on. The overall numerical complexity of the procedure is thus proportional to .
3.2 Estimation of the number of change-points
In practice the number of change-points can rarely be assumed to be known. Although it is not the main focus of this paper, we propose a heuristic algorithm to find the optimal number of change-points. Values of the statistics , for , are first computed using the procedure described in the previous section. The algorithm is based on the principle that in the presence of change-points, if is plotted against , the resulting graph can be decomposed into two distinct linear trends: a first one, for where the criterion is growing rapidly; and a second one, for , where the criterion is barely increasing. Hence, for each possible value of in , we compute least square linear regressions for both parts of the graph (before and after ); the estimated number of change-points is the value of that yields the best fit, that is, the value for which the sum of the residual sums of squares computed on both parts of the graph is minimal (see Figure 1). The case is treated separately and the procedure that has just been described is used only when the value of is significant, based on the asymptotic -value for the single change-point case obtained in .
4 Numerical experiments
In this section, the performance of the dynMKW algorithm is assessed using simulated signals and we next consider its application to data arising from a biological context.
4.1 Simulated data
The simulation reported is this section is based on a fixed 5-dimensional piecewise constant signal of length 500 with a predefined set of four shared breakpoints. Note that this signal reproduces an important feature of many applications (see  as well as the example of the cancer data below) in that only a subset of its coordinates actually change simultaneously. To this baseline signal is added some moderately correlated Gaussian noise with marginal variance , see Figure 2 for an example corresponding to a marginal SNR of 16 dB, where the SNR is defined as the ratio of the jump amplitude over . The performance of the algorithms is assessed from 1000 Monte-Carlo replications of the data for each value of and measured in terms of precision (proportion of correctly estimated changes found among the detected change-points) and recall (proportion of actual change-points that have been correctly estimated). We used a sample tolerance for deciding whether a change-point has been correctly estimated but the conclusion were qualitatively the same for larger values of this tolerance.
Known number of change-points
We compare our algorithm with two other different methods – which also rely on a dynamic programming step – for multiple change-points detection: first, with the kernel-based approach of  that computes “intra-segment scatter” using an isotropic Gaussian kernel function and second, the parametric method corresponding to the multivariate Gaussian assumption.
As shown in Figure 3 left, the parametric method (“Linear”) yields the best results, as the noise added to the signal is indeed Gaussian. Unsurprisingly, the proposed dynMKW approach performs slightly worse in this scenario but is certainly preferable than the kernel test, particularly when the level of noise increases. In Figure 3 right, we repeat the experiment with significant contamination by outliers. The dynMKW method demonstrates its robustness as, contrary to both alternatives, its performance barely suffer from the presence of outliers.
Unknown number of change-points
Using the same signal as previously, we suppose in the experiments presented here that the number of change-points is unknown. Our algorithm, combined with the approach for estimating the number of change-points presented in Section 3.2 is compared with a greedy hierarchical method suggested by : Breakpoints are detected by iteratively testing for the presence of a single change-point in the sub-segments determined in previous stages. This method (“Vost”) is applied using (8) restricted to groups as a single change-point detector, where the asymptotic -value determined in  is computed to decide whether the current segment should be segmented any further (segmentation stops whenever the -value is larger than a threshold, say 1 or 5%).
Results are shown in Figure 4 with precision and recall measures. It is observed that dynMKW outperforms the Vost procedure both in precision and recall. From a more qualitative perspective, the Vost procedure tends to systematically over-segment the signal while our method, when faulting, is more inclined to miss some change-points.
4.2 Bladder tumor CGH profiles
We consider the bladder cancer micro-array aCGH dataset
- B. E. Brodsky and B. S. Darkhovsky, Nonparametric Methods in Change-Point Problems, Kluwer Academic Publisher, 1993.
- M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: Theory and Applications, Prentice-Hall, 1993.
- C. Lévy-Leduc and F. Roueff, “Detection and localization of change-points in high-dimensional network traffic data,” Annals of Applied Statistics, vol. 3, no. 2, pp. 637–662, 2009.
- Y.C. Yao and S. T. Au, “Least-squares estimation of a step function,” Sankhya: The Indian Journal of Statistics, Series A,, vol. 51, no. 3, pp. 370–381, 1989.
- M. Lavielle and E. Moulines, “Least-squares estimation of an unknown number of shifts in a time series,” Journal of Time Series Analysis, vol. 21, no. 1, pp. 33–59, 2000.
- S.M. Kay, Fundamentals of statistical signal processing: detection theory, Prentice-Hall, Inc., 1993.
- E. Lebarbier, “Detecting multiple change-points in the mean of a Gaussian process by model selection,” Signal Process., vol. 85, pp. 717–736, 2005.
- B. Jackson, J. Scargle, D. Barnes, A. Sundararajan, A. Alt, P. Gioumousis, E. Gwin, P. Sangtrakulcharoen, L. Tan, and T-T. Tsai, “An algorithm for optimal partitioning of data on an interval,” IEEE Signal Processing Letters, vol. 12, pp. 105–108, 2005.
- M. S. Srivastava and K. J. Worsley, “Likelihood ratio tests for a change in the multivariate normal mean,” J. Amer. Statist. Assoc., vol. 81, no. 393, pp. 199–204, 1986.
- Z. Harchaoui and O. Cappé, “Retrospective multiple change-point estimation with kernels,” in IEEE Workshop on Statistical Signal Processing, 2007.
- A. Lung-Yut-Fong, C. Lévy-Leduc, and O. Cappé, “Robust changepoint detection based on multivariate rank statistics,” in To appear in IEEE Int. Conf. Acoust., Speech, Signal Processing, 2011.
- W.H. Kruskal and W.A. Wallis, “Use of ranks in one-criterion variance analysis,” Journal of the American statistical Association, vol. 47, no. 260, pp. 583–621, 1952.
- L. Yu. Vostrikova, “Detecting disorder in multidimensional random processes,” Soviet Math. Dokl., vol. 24, pp. 55–59, 1981.
- M. Lavielle, “Using penalized contrasts for the change-points problems,” Signal Process., vol. 85, no. 8, pp. 1501–1510, 2005.
- J.J. Ruanaidh and W.J. Fitzgerald, Numerical Bayesian Methods Applied to Signal Processing, Springer, 1996.
- P. Fearnhead, “Exact and efficient bayesian inference for multiple changepoint problems,” Statistics and Computing, vol. 16, pp. 203–213, 2006.
- Z. Harchaoui and C. Lévy-Leduc, “Multiple change-point estimation with a total variation penalty,” J. Amer. Statist. Assoc., vol. 105, no. 492, pp. 1480–1493, 2010.
- J.P. Vert and K. Bleakley, “Fast detection of multiple change-points shared by many signals using group LARS,” in Advances in Neural Information Processing Systems 23, 2010.