Novel semimetrics for multivariate change point analysis and anomaly detection
Abstract
This paper proposes a new method for determining similarity and anomalies between time series, most practically effective in large collections of (likely related) time series, by measuring distances between structural breaks within such a collection. We introduce a class of semimetric distance measures, which we term MJ distances. These semimetrics provide an advantage over existing options such as the Hausdorff and Wasserstein metrics. We prove they have desirable properties, including better sensitivity to outliers, while experiments on simulated data demonstrate that they uncover similarity within collections of time series more effectively. Semimetrics carry a potential disadvantage: without the triangle inequality, they may not satisfy a “transitivity property of closeness.” We analyse this failure with proof and introduce an computational method to investigate, in which we demonstrate that our semimetrics violate transitivity infrequently and mildly. Finally, we apply our methods to cryptocurrency and measles data, introducing a judicious application of eigenvalue analysis.
keywords:
semimetrics, changepoint detection, multivariate analysis, time series, anomaly detection1 Introduction
Similarity and anomaly detection are widely researched problems in statistics and the natural sciences. Change point detection is an important task in time series analysis, and more broadly, within anomaly detection. Developed by Hawkins et al. Hawkins1977; Hawkins2003, this task requires one to estimate the location of changes in statistical properties, points in time at which the estimated probability density functions change. In the more statistical literature, focussed on time series data, statisticians such as Ross Ross2011T; Ross2011 have developed change point models driven by hypothesis tests, where values govern statistical decision making. The change point models applied in this paper follow Ross RossCPM.
Various change point algorithms test for shifts in different underlying distributional properties and generally make strong assumptions regarding the statistical properties of random variables. In this paper, we make use of the MannWhitney test, described in Pettitt1979, which detects changes in the mean, and the KolmogorovSmirnov test, described in Ross2012, which detects more general distributional changes. These algorithms make the strong assumption of independence; for applicability of the change point algorithms on data with dependence, see transformations due to Gustafsson Gustafsson2000.
Metric spaces appear throughout mathematics. One particular field of study that has arisen in image detection and other applications is the study of metrics on the power set of (certain) subsets of . The most utilized metric in this context is the Hausdorff metric, which we introduce and summarize in section 2. This provides a metric between closed and bounded subsets of any ambient metric space . In addition to the Hausdorff metric, there are several semimetrics, satisfying some but not all of the properties of a metric, which are still useful. These properties will also be summarized in section 2.
Conci and Kubrusly Conci2017 give an overview of certain such (semi)metrics on the space of subsets and some applications. This review breaks down the applications of such distances between subsets into three primary areas, computational aspects Eiter1997; Atallah1983; Atallah1991; Barton2010; Shonkwiler1989; Huttenlocher1990; Huttenlocher1992; Aspert2002, distances between fuzzy sets Brass2002; Fujita2013; Gardner2014; Rosenfeld1985; Chaudhuri1996; Chaudhuri1999; Boxer1997; Fan1998 and distances in image analysis Fujita2013; Gardner2014; Dubuisson1994; Rote1991; Li2008; Huttenlocher1990; Huttenlocher1992; Rucklidge1995; Rucklidge1996; Rucklidge1997. The Hausdorff’s sensitivity to outliers has been noted by Baddeley1992, and has proven itself largely unsuitable for algorithmic problems pertaining to image analysis. In section 5, we present similar findings when using the Hausdorff distance between finite sets of change points.
There has been extensive work in determining similarity between time series. Moeckel and Murray Moeckel1997 survey the shortcomings of possible distance functions between time series, stating the Hausdorff metric’s limitation in ignoring the frequency with which one set visits parts of the comparable set. That is, the metric only focuses on one measurement between two candidate sets, and is sensitive to outliers. Instead, they propose a distance function developed by Kantorovich Kantorovich2006 that is based on geometric and probabilistic factors; this proves to be robust with respect to outliers, noise and discretisation errors. Moeckel and Murray also demonstrate that the transportation distance proposed by Kantorovich can be used to evaluate mathematical models for chaotic or stochastic systems, and for parameter estimation within a dynamical systems context.
To our knowledge, there has been no work in detecting the similarity between time series’ change points. Change points signify changes in the statistical properties of a time series’ distribution, so determining which time series are most similar with respect to the number and location of these changes is of interest to analysts, in a wide variety of disciplines, who wish to assess the underlying structure in larger collections of time series. Equally of interest are time series whose change points are dissimilar to the rest of the collection and exhibit anomalous behaviour with respect to their structural breaks.
The contributions of this paper are as follows. First, we introduce a new family of semimetrics and provide analysis of their properties, and new analysis of existing semimetrics. Next, we apply such semimetrics and metrics to measure distance between time series based on their structural breaks, forming a new distance matrix between time series. We introduce a computational method for analysing the transitivity properties of a semimetric, and perform it in this setting. Finally, we introduce a simple but pithy method of eigenvalue analysis in order to determine the size of a majority cluster in our setting. In circumstances when one expects, a priori, a large majority of time series to behave very similarity, with some anomalies, our eigenvalue analysis quickly approximates the size of a majority cluster and provides an understanding of the total scale of the distance matrix.
The rest of the paper is organised as follows. Section 2 provides a review of existing (semi)metrics. In section 3, we propose a new family of semimetrics, analyse their desirable properties, and prove propositions on both the new and existing semimetrics. Section 4 describes our computational methodology. Section 5 conducts simulations in three scenarios, analysing the robustness of the metrics and semimetrics in the presence of outliers. Section 6 applies our analysis to the cryptocurrency market and 19th century UK measles data. We use our eigenvalue analysis, as well as hierarchical and spectral clustering. Section 7 concludes the paper. In A and B respectively, we provide a description of the change point algorithm used, and include all remaining proofs.
2 Review and analysis of existing (semi)metrics
In this section, we review some properties of a metric space and the existing Hausdorff, modified Hausdorff and Wasserstein (semi)metrics. Most significantly, we describe exactly how the Wasserstein is used between finite sets, which does not appear clearly in the literature.
A metric space is a pair where is a set and satisfies the following axioms for all :

, with equality if and only if .

.

A semimetric satisfies 1 and 2, but not necessarily 3, which is known as the triangle inequality.
Given a subset and a point , the distance from a point to a set is defined as the minimal distance from to , given by:
(1) 
, with equality if and only if lies in the closure of . And is continuous. Now let . A common notion of distance between these subsets is defined as the minimal distance between these subsets, given by:
(2) 
Note if intersect. In fact, if and only if their closures intersect. So this is not an effective metric between subsets.
Definition 2.1 (Hausdorff distance).
The Hausdorff distance considers how separated and are at most, rather than at least. It is defined by:
This is the supremum or norm of all minimal distances from points to and points to . The Hausdorff distance satisfies the triangle inequality, but this supremum is highly sensitive to even a single outlier. We propose using the norms instead. Henceforth, and will be finite sets. Eventually, will be sets of structural breaks of time series. We present three modified Hausdorff distances below.
Definition 2.2 (Modified Hausdorff distance 1).
The first modified Hausdorff distance MH is defined by
It is presented in Deza Deza2009 and Dubuisson Dubuisson1994. As is the case with most modified Hausdorff metrics, the primary application to date has been in computer vision tasks, where semimetrics and metrics focused on geometric averaging provide a more robust distance measure in comparison to the Hausdorff distance.
Definition 2.3 (Modified Hausdorff distance 2).
The second modified Hausdorff distance MH is defined by
Eiter Eiter1997 and Dubuisson Dubuisson1994 present this distance measure, which captures the total distance between one set and another. Removing the operator from outside the bracket yields what is essentially a measure of total deviation between all points of two sets.
Definition 2.4 (Modified Hausdorff distance 3).
The third modified Hausdorff distance MH is defined by
Deza Deza2009 and Dubuisson Dubuisson1994 propose this as variant of d with a different averaging component. This measure is referred to as geometric mean error between two images.
Definition 2.5 (Wasserstein distance).
The Wasserstein metric, DelBarrio is commonly used as a distance between two probability measures. Intuitively, it gives the work (in the sense of physics) required to mould one probability measure into another. Given probability measures on a metric space , define
This infimum is taken over all joint probability measures on with marginal probability measures and . Now let be finite sets. Associate to each set a probability measure defined as a weighted sum of Dirac delta measures
(3) 
The Wasserstein distance is defined as . In subsequent experiments, when using the Wasserstein metric, we set .
3 Proposed (semi)metrics
In this section, we introduce a new family of semimetrics, and analyse their properties and advantages over existing options. First, we motivate and introduce the MJ semimetric, then generalize this to the family of MJ semimetrics, which, when properly extended to infinity, includes the Hausdorff metric.
In Dubuisson1994, Jain and Dubuisson assert that their distance MH is the best for image matching. To reach this conclusion, they take two steps. First, (page 567) they compare three favourable operators , each operating on minimal distances as defined in section 1. They briefly argue that , equivalent to taking the max in the MH, is preferable to other operators, citing a “larger spread.” Secondly, (page 568) they argue that a process of averaging distances is superior to taking th ranked distances, such as the median. We differ with and modify these steps of reasoning. For the first, we replace the max in their MH with the norm average of all the minimum distances from to and to :
Then we show desirable properties of MJ over MH and MH in the following two propositions.
Proposition 3.1 (Comparison between MJ and MH).
MJ and MH are equivalent as semimetrics. However, MJ is more precise, in the sense that there exists a class of deformations of , such that will vary continuously with while will not vary at all.
The proof is given in B.1.
Proposition 3.2 (Comparison between MJ and MH, Mh).
The following property holds for MJ but not MH or MH: if all elements of are duplicated, does not change, while does change,
If we duplicate elements of a set, the set itself does not change, so a measure between sets should not change under such a duplication.
The proof is given in B.2. MH in particular greatly enlarges with the duplication or addition of points.
Remark 3.3.
Unfortunately, if one single point of is duplicated, does change. This is the case with all existing modified Hausdorff distances. Yet even this is not disastrous, because it reflects that a greater concentration of certain elements of represents a different distribution of the data points in .
Regarding the second step of Dubuisson1994, we agree that an averaging process is much less sensitive to outlier error than the alternative processes. However, we may generalize this process by using other norm averages. And so we present generalized semimetrics below.
Definition 3.1.
We define the MJ distance by
This is chosen so that for all and
Hence, can now be viewed as the norm of these distances. Thus, our family of semimetrics includes the Hausdorff distance as a limiting case when , placing the existing Hausdorff metric in a new family of semimetrics.
Remark 3.4.
Usually in the context of norms, must be in the range to preserve the triangle inequality. Since these measures do not preserve the triangle inequality, we can take . This means that , for example, is even less sensitive to outliers than the MH and MJ distances.
As grows larger, approaches the Hausdorff metric, which satisfies the triangle inequality. As grows smaller, these distances are less sensitive to outliers. Thus, this continuum of allows us to compromise between the triangle inequality property of the metric, and the sensitivity to outliers. In section 5.5, we explore the possibility of optimising under these considerations.
If we were guaranteed that all distances and were nonzero, we could also consider . For the above norm is properly interpreted as a limit which equals the geometric mean
As from equation (2). Even this quantity is contained in our new family.
Proof.
Identical to the proofs of these propositions in B. ∎
Proposition 3.6.
For , the MJ measures are semimetrics. However, they fail the triangle inequality up to any constant. That is, there is no constant such that
(4) 
for any subsets . This also applies to MH, for
The proof is given in B.3. We remark that this proof does not exist in the literature even for the existing modified Hausdorff semimetrics.
3.1 Sensitivity to outliers
We examine the sensitivity to outliers of all discussed metrics and semimetrics. Let and be fixed. Fixing all but one element, if acts as an outlier, we examine the effect on all distances .
First, the Hausdorff distance increases asymptotically with for sufficiently large, illustrating its unsuitability for outliers. That is,
MH contains the term which also increases asymptotically with That is, , illustrating its sensitivity to outliers. Due to the averaging within MH, MH and MJ, all of these semimetrics perform well with outliers, but this gets worse as increases, and better if it decreases. Specifically,
Finally, we examine a property of the MJ family, but not the Wasserstein distance, indicating the latter’s unsuitability to measure distance between data sets.
Proposition 3.7.
If the following inequality holds:
No such inequality holds for Wasserstein distance. Even with , it is possible for to coincide with .
The proof is given in B.4.
Remark 3.8.
As a consequence of proposition 3.7, if and have a large amount of similarity in their elements, will reflect this close similarity between and , while the Wasserstein distance may not. This will prove useful in analysing data sets in sections 5,6. We adopt an example from the proof here. Let and . Observing sets , clear candidates for distances between them are and . Indeed, while .
Using the Wasserstein or Hausdorff distance, the separation between and remains . This is an appropriate distance from a translational or geometric point of view. However, it ignores the remarkable similarity in the data of and . If these were sets of change points, they would be considered remarkably similar. Appropriately, , .
To summarize, we have proven:
Theorem 3.9.
There exists a family of semimetrics MJ which include the Hausdorff distance as a limiting member when . Like MH, they fail the triangle inequality up to any constant. However, they have a precision advantage over MH, a duplicationinvariance advantage over MH, and are much more insensitive to outliers than MH and the Hausdorff metric. They also are more suitable than the Wasserstein at reflecting high intersection in the data.
4 Computational methodology
To generate our distance matrix , we compute the distance between all sets of time series change points within our collection of time series. Suppose we have time series with sets of change points . We define the following distance matrices.
Hausdorff distance matrix :
(5) 
MH distance matrix :
(6) 
MH distance matrix :
(7) 
MH distance matrix :
(8) 
Wasserstein Distance matrix :
(9) 
MJ Distance matrix :
(10) 
4.1 Transitivity analysis
Like the modified Hausdorff distances, the novel proposed distance measures MJ do not satisfy the triangle inequality in generality. This is significant because it is possible that sets and are each close with respect to these measures, but are not close. Then the property of closeness would not be transitive. However, in practice, the distance measures respect transitivity quite well, at least for . We examine two questions:

how often do the new semimetric distances fail the triangle inequality;

how badly do these distances violate the triangle inequality.
To explore these two questions, we empirically generate a three dimensional matrix and test whether the triangle inequality is satisfied for all possible combinations of elements within the matrix. We construct our matrix as follows:
(11) 
4.2 Eigenvalue analysis
Analysing the eigenvalues and eigenvectors of a system in physical and applied sciences is of great importance. Matrix diagonalzation and matrix’ eigenspectrum arise in many applications such as stability analysis and oscillations of vibrating systems. In our context, we analyse the distance matrices , all of which are symmetric real matrices with trace . As such, they can be diagonalized over the real numbers with real eigenvalues. To determine similarity of time series with respect to their change points, we plot the absolute value of eigenvalues for all matrices. Note all eigenvalues are real and sum to zero.
Consider the following real world heuristic: many real time series, such as stock returns, are not necessarily highly correlated on a regular basis. The returns of Microsoft and Ford may have little to do with each other over time. However, it is expected that a significant market event or crash would significantly affect both Microsoft and Ford at essentially the same time, and yield a change point in the stochastic properties of both time series at the same time. Thus, even if the overall properties of time series may be uncorrelated or negatively correlated, change points are likely to cluster. It would be of considerable interest if a third time series, say the returns of a new green energy company, had change points different from the majority. Perhaps this third time series would then be concluded to be less vulnerable to a market crash. In our analysis of time series, especially cryptocurrency in section 6, we expect a large majority of time series to follow similar change points, and from these we will be able to examine the exceptional ones for any opportunities.
Mathematically, if say out of time series have very similar change points, then the distance matrix should have the following structure:
where rows are highly similar to one another and elements are close to zero. This means small deformations in the matrix entries exist to make the first rows identical. Hence, this matrix is a small deformation from a rank matrix, with eigenvalues equal to zero. That is, if of time series have very similar change points, then of the eigenvalues should be close to zero.
Given a threshold , we can rank the absolute values of the eigenvalues . If then we can deduce of the time series are similar with respect to their stochastic breaks. This may be the most pithy way of expressing the number of time series that are similar in terms of their change points within large collections of time series. If we a priori have reason to believe that one large majority cluster will exist, a judicious choice of can determine its size immediately. One can approximate its size by inspection from the graphical depictions such as Figures 2,6,10.
Moreover, eigenvalue analysis provides us a quick measure of the scale of the distance matrix. Since all distance (and affinity) matrices are symmetric, can be conjugated by an orthogonal matrix to give a diagonal matrix of its eigenvalues. This is known as the spectral theorem, Axler. As a consequence, the operator norm RudinFA coincides with . That is,
Remark 4.1.
Even when these plots look quite similar, the scale gives us information about the scale of the distance matrices.
4.3 Spectral clustering affinity matrix
Spectral clustering applies a graph theoretic interpretation of our problem, and projects our data into a lower dimensional space, the eigenvector domain, where it may be more easily separated by standard algorithms such as means. We transform our distance matrix into an affinity matrix as follows:
(12) 
The graph Laplacian matrix is given by:
(13) 
where is the diagonal degree matrix with diagonal entities . The graph Laplacian matrix possesses four key properties:

, where is any candidate eigenvector.

is symmetric and positive semidefinite.

The smallest eigenvalue of is and the respective eigenvector is

Its eigenvalues are nonnegative real numbers
After computing the graph Laplacian matrix , we determine the top eigenvectors and construct the matrix from . Finally, we treat each row of as a vertex in the low dimensional projection of our graph and cluster these vertices using the standard means algorithm. Following Ng et al. Ng2002, is chosen to be the value that maximizes the eigengap .
4.4 Dendrogram analysis
A dendrogram displays the hierarchical relationships between objects in a dataset. Hierarchical clustering falls into two categories:

Agglomerative clustering  a bottomup approach where all data points start as individual clusters; or

Divisive clustering  a topdown approach where all data points start in the same cluster and are recursively split.
The dendrogram and hierarchical clustering results are highly dependent on the distance measure used to determine clusters. We display the respective dendrogram of our eight candidate distance matrices and assess which method displays similarity between time series most appropriately for our change point problem. The colours of the dendrogram indicate the closeness of any two sets of time series change points.
5 Simulation study
We generate three collections of ten time series. The first collection exhibits very few change point outliers, the second exhibits a moderate number of less severe change point outliers and the final collection of time series exhibits multiple extreme change point outliers. The Hausdorff, three modified Hausdorff varieties, Wasserstein, MJ, MJ and MJ distances are compared between the time series.
5.1 Simulation : no change point outliers
Spectral Clustering Results  

Metric  TS1  TS2  TS3  TS4  TS5  TS6  TS7  TS8  TS9  TS10 
Hausdorff  1  1  1  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  3  1  4  1 
MH  1  1  1  1  1  2  2  2  3  4 
Wasserstein  1  2  3  4  2  2  2  2  2  2 
MJ  1  1  1  1  1  2  2  2  3  4 
MJ  1  1  1  1  1  2  2  2  3  4 
MJ  1  1  1  1  1  2  2  2  3  4 
Figure 1 displays the ten time series of candidate change points. When assessing if two time series are similar with respect to their change points, we are interested in both the location and number of change points. There is an average spacing of about 35 units between change points; this simulates realistic outputs from a change point detection algorithm that generally requires a minimum number of data points within locally stationary segments. In this scenario, one should consider the first five time series (15 incl.) as similar, the next three (68 incl.) as similar and the final two (9 and 10) as dissimilar to all other time series. Although there are no change point outliers in this scenario, it is instructive to measure how various distance measures perform without the presence of outliers.
Interpreting similarity among large collections of time series’ change points may be a difficult task. Therefore, we make inference using all three of our proposed methods in section 4 to analyse some candidate distance matrix. Perhaps the most concise and expressive display of general similarity or dissimilarity within any such collection is the plot of the absolute value of the eigenvalues of the distance matrices. We compare all our distance measures in Figure 2. All distance measures appear to indicate that there are five time series that are highly similar, three that are slightly less similar and two far more dissimilar to the rest of the collection. In this instance, all eigenvalue plots look very similar: without the existence of outliers, all these distance measures perform similarly, so this is expected. Note the difference in scale of the diagrams reflects the value of , hence the total scale of these matrices.
Table 1 shows that of our eight distance measures, six distance measures, namely Hausdorff, MH, MH, MJ, MJ and MJ, cluster the time series correctly. Both the Wasserstein and MH distances fail to determine appropriate clusters within the spectral clustering. The dendrograms in Figure 3 should be analysed carefully. All distance measures indicate that there is a cluster of five change point sets that are similar, another cluster of three and two unrelated change point sets. However, spectral clustering highlights that the Wasserstein and MH distance measures incorrectly identify which time series should be considered similar.
We also analyse the transitivity, described in section 4.1, over all sets of change points for the eight semimetrics in our analysis. The MJ fails most significantly, with 51% of potential triples failing, and an average fail ratio of 1.28. Both the MJ and MJ distances fail the triangle inequality in this simulation, with 4% of elements in the matrix failing for the MJ distance and 3% of of elements failing for the MJ distance. Of those elements that fail the triangle inequality, the average fail ratio is 1.32 and 1.14 for the MJ and MJ distances respectively. As expected, as increases, transitivity seems to improve. All of the modified Hausdorff distances fail the triangle inequality too. 8.5% of MH triples fail the triangle inequality, with an average fail ratio of 1.06. The MH has a lower percentage of failed triples than the MH with only 4% failing, however those that do fail perform significantly worse, with an average fail ratio of 1.49. The MH also has 4% of triples fail, with a less severe average fail ratio of 1.41. So in this scenario the MJ has the most failed triples by a significant margin, however the MH has the highest fail ratio. This shows that MH violates the triangle inequality most severely.
5.2 Simulation : moderate change point outliers
Spectral Clustering Results  

Metric  TS1  TS2  TS3  TS4  TS5  TS6  TS7  TS8  TS9  TS10 
Hausdorff  1  2  2  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  3  4 
Wasserstein  1  2  3  2  4  2  2  2  2  2 
MJ  1  1  1  1  1  2  2  2  3  4 
MJ  1  1  1  1  1  2  2  2  3  4 
MJ  1  1  1  1  1  2  2  2  3  4 
Figure 5 shows ten simulated time series change points that exhibit outliers with moderate frequency and severity. This is a more realistic scenario than Figure 1, as outliers occur regularly when applying change point detection algorithms to multiple time series. Again, time series 15, 68, 910 should be identified as separate clusters.
Figure 6 displays the increasing absolute value of the eigenvalues. In this simulation, we see that that Hausdorff distance in Figure 5(a) has detected eight time series that are highly similar, and two that are dissimilar relative to the others. Other distance measures have identified the general similarity more appropriately. The MH, MH, MJ, MJ and MJ metrics in particular appear to produce sensible outputs.
The dendrograms displayed in figure 7 illustrate the Hausdorff distance’s sensitivity to outliers. The remaining six distance measures correctly identify the general structure in the time series collection. That is, there are two separate clusters of highly similar time series and two unrelated time series.
The spectral clustering results in table 2 indicate that five of the seven distance measures correctly identified similar groupings of time series, namely: MH, MH, MH, MJ, MJ and MJ produced the correct groupings of time series. Once again, the Wasserstein distance, although producing eigenvalue and dendrogram outputs consistent with the successful distance measures, proposed inappropriate collections of time series within candidate clusters.
In the presence of moderate outliers, the MJ violates the triangle inequality worst, both in terms of the percentage of failed triples of 58% and the average fail ratio of 1.54. Both the MJ and MJ distances fail the triangle inequality (Figure 8) too, with 10% of potential distances failing the triangle inequality when using the MJ distance and 5.6% with the MJ distance. The average fail ratio is 1.13 for the MJ distance and 1.08 for the MJ distance. For the modified Hausdorff distances, the MH has the worst failed triple ratio (1.19) with 5.6% of potential triples violating the triangle inequality. For MH, 8.4% of triples fail the triangle inequality, with an average fail of 1.14. For MH, 10.4% fail with an average fail ratio of 1.11. In this scenario, the MJ is the only semimetric whose average fail ratio and percentage of failed triples may make the ratio unusable. This may however depend on the context of usage.
5.3 Simulation : extreme change point outliers
Spectral Clustering Results  

Metric  TS1  TS2  TS3  TS4  TS5  TS6  TS7  TS8  TS9  TS10 
Hausdorff  1  1  2  2  2  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  3  4 
MH  1  1  1  1  1  2  2  2  4  3 
Wasserstein  1  1  2  3  1  4  1  1  1  1 
MJ  1  1  1  1  1  2  2  2  3  4 
MJ  1  1  1  1  1  2  2  2  3  4 
MJ  1  1  1  1  1  2  2  2  3  4 
Figure 9 displays the third collection of time series’ change points, where we analyse the effects of extreme outliers on candidate distance measures. This is a highly contrived scenario in our application of change point detection. First, change point algorithms typically require a minimum number of points within locally stationary segments, while in this scenario, all of our time series have change points in succession and several time series have an extreme outlier. The purpose of this scenario is to highlight measures which do not perform well in the case of extreme outliers, to identify measures that fail the triangle inequality and to investigate how badly they fail the triangle inequality.
The presence of outliers certainly impacts most of the distance measures when analysing the magnitude of the eigenvalues (seen in Figure 10). The Hausdorff distance in particular (Figure 9(a)) fails to identify the similarity in the first five time series (15). Interestingly, the MH does not display the appropriate degree of similarity in the first five time series, while the MH and MH measures produce plots that indicate similarity among the change point sets more appropriately. The Wasserstein distance in Figure 9(e) and MJ in Figure 9(g) both produce outputs consistent with the phenomenology of the scenario. Both the MJ and MJ also produce appropriate outputs.
The dendrograms displayed in Figure 11 indicate that the Hausdorff distance performs worst. The heat map fails to identify appropriate similarity among the time series. The MH, MH, MH, MJ and MJ distances produce heat maps that are most representative of the true similarity in the set. Interestingly, the MJ (Figure 10(h)) distance does not perform as well as the MJ or MJ (Figure 10(g)) in this scenario, perhaps due to the higher order of . That is, the lower orders of provide stronger geometric mean averaging. In particular, the MJ distance has particular difficulty distinguishing between clusters 2, 3 and 4, which should contain time series 68, 9 and 10 respectively.
Spectral clustering highlights that the MH, MH, MH, MJ, MJ and MJ correctly identify clusters of similar time series. Again, both the Hausdorff and Wasserstein metrics do not perform well. The Hausdorff metric in particular has severe sensitivity to outliers.
The transitivity analysis under this (contrived) scenario of extreme outliers demonstrates that the MJ may be unusable, with 47.7% of triples failing the triangle inequality and an average fail ratio of 8.63. The MJ and MJ distances have good proportions of triples that fail the triangle inequality, but significantly less than MJ. In the case of the MJ distance, 14.2% of triples fail the triangle inequality, with an average fail ratio of 1.86. In the case of the MJ distance, 11% of triples fail the triangle inequality with an average fail ratio of 1.45. 14.4% of MH triples fail, with an average fail ratio of 2.18; 15% of MH triples fail, with an average fail ratio of 1.85; 15% of MH triples fail, with an average fail ratio of 1.86.
5.4 Summary of simulations
The MJ, MJ and modified Hausdorff semimetrics all display an improvement over the traditional Hausdorff and Wasserstein metrics with regards to similarity and anomaly detection among collections of time series. The MJ family does appear to provide an improvement over the modified Hausdorff semimetrics in terms of inference. There are various cases where the MH, MH or MH exhibit errors in clustering experiments. Additionally the proportion of failed triples and severity of fails is not prohibitively worse in any of the scenarios explored. The order of in the MJ semimetrics has a significant impact on transitivity properties. The MJ is clearly the worst violator of the triangle inequality. With this in mind, recalling section 3 and Dubuisson1994 we again differ with Jain and Dubuisson’s conclusions that MH is the best for image matching. Both the MJ and MJ have advantages over MH. MJ may be appropriate in situations where we do not care about the triangle inequality, or where we have verified through this analysis that it does not fail transitivity too severely. In such scenarios, MJ is considerably more robust with respect to outliers than all other options.
5.5 Role of order in geometric averaging
Given that there is a clear tradeoff between various distance measures’ transitivity and robust performance in the presence of outliers, one potential avenue to be explored would be optimising the order in the norm. That is, should be large enough to satisfy the triangle inequality, yet small enough to allow for geometric averaging. We find that when gets beyond 2 or 3, the geometric averaging property is lost and the measure becomes sensitive to outliers. Alternatively, one may allow an acceptable percentage of measurements within the time series collection to violate the triangle inequality. That is, we could insist that
where is some acceptable percentage of failed triples within the matrix. We set Figure 12(a) demonstrates that once reaches , less than 5% of triples in fail the triangle inequality. However, the dendrogram in figure 13 demonstrates that the MJ distance no longer provides the geometric averaging required to produce robust measurements. In fact, inference gained with the MJ distance would be entirely erroneous. When considered in conjunction with the results from other simulated experiments, our findings suggest that needs to be low for powerful geometric averaging and robustness to outliers.
6 Real applications
6.1 Cryptocurrency
The cryptocurrency market is in its relative infancy in comparison to most exchange traded financial products. Cryptocurrencies are infamous for their volatile price behaviour, and high degree of correlation within the market, due to crowd behaviour often referred to as “herding”.
We analyse the similarity in the change points of the thirty largest cryptocurrencies by market capitalisation. We apply the MannWhitney test to the log returns of each cryptocurrency and use the MJ distance as our measure of choice to allow powerful geometric averaging and robustness to outliers. Note that the log returns provide approximately normally distributed data with mean zero. Given the extreme volatility and associated kurtosis displayed by cryptocurrencies, this transformation is essential.
Our method provides the following insights:

The cryptocurrency market is characterised by a high degree of similarity between time series. This is seen in the plot of the eigenvalues (Figure 13(a)), which confirms that approximately 24 cryptocurrencies behave very similarly. The dendrogram in Figure 13(b) confirms this, with a large proportion of the market exhibiting a small distance between change points.

Our method highlights the presence of anomalous cryptocurrencies, and indicates which cryptocurrencies are behaving dissimilarly to the rest of the market. In particular, Figure 13(b) shows that XMR and DCR behave very differently to the rest of the cryptocurrency market. They do however have a high degree of similarity between themselves.

Spectral clustering on the distance matrix confirms the presence of anomalous cryptocurrencies warranting their own cluster.

Finally, the dendrogram in Figure 13(b) highlights the presence subclusters of cryptocurrencies. These are subsets that behave similarly to one another, and less similarly to the rest of the market. This is often the case in financial markets, where companies in similar sectors or geographic regions may become correlated due to an exogenous variable or variables.
To summarize, we have uncovered a high degree of correlation within the cryptocurrency market, and unearthed anomalies, including clusters thereof. Our analysis gives insight into anomalous and similar behaviours with respect to structural breaks, a feature of time series of key interest to analysts and market participants.
6.2 19th Century UK Measles counts
Given the recent public interest in novel Coronavirus (2019 nCoV), we analyse historic counts of measles  a similarly infectious virus. In this context, understanding change points could have immense public health significance: signifying perhaps a growth in the infectivity, or a temporary scare where fears were allayed and the disease’s spread halted. Applying our method to measles counts in 19th century UK cities, one can determine the similarity between structural changes in time series and perform anomaly detection simultaneously. First, we apply the KolmogorovSmirnov change point detection algorithm to all 7 time series, and yield 7 sets of change points.
Analysis on UK measles data yields the following insights:

Figure 15(b) suggests that there are 5 time series that are similar and two relative outliers.

Figure 15(a) highlights that there are 4 time series that are similar with respect to their structural breaks, 2 moderately similar and 1 anomalous city. The time series deemed to be most similar are Sheffield, London, Manchester and Newcastle. This insight is consistent with visual inspection of the time series in Figure 16, where these time series exhibit multiple periodicities that are detected via the change point algorithm. Although our algorithm does not explicitly measure the periodic nature of each time series, our method manages to capture the most pertinent feature in this collection of time series.

Birmingham is detected as the anomalous time series, and this is supported by the results from spectral clustering the distance matrix’s graph Laplacian.
7 Conclusion
Prior work indicates that metrics adapting distance measures have mostly been used in computer vision applications. Although prior work has disproved the triangle inequality for these measurements, we are unaware of any work examining theoretically or empirically how badly the triangle inequality fails, and under what conditions. Our experiments on simulated and real data indicate that when there are more outliers within a collection of time series, and when there is a smaller average distance between successive change points  there is generally a higher percentage of time series which fail the triangle inequality. This is reflected in proposition 3.6 and its proof, showing that the bunching of points can cause the triangle inequality to fail up to any arbitrary constant. While the Hausdorff metric satisfies the triangle inequality, it is extremely sensitive to outliers, as is confirmed in prior work among varying applications. This presents an interesting tradeoff. Semimetrics, such as the ones investigated in this paper, may lose universal transitivity, however different averaging methods (other than the operation) produce more appropriate distance measurements between time series.
After applying a distance measure between sets of change points, we demonstrate the insights one can generate. First, we show a pithy means of eigenvalue analysis where we plot the absolute value of the ordered eigenvalues. This analysis illustrates the number of time series that are similar within any candidate scenario. We demonstrate the dendrogram’s ability to simultaneously uncover similarity between time series and perform anomaly detection. Finally, we use spectral clustering on a transformation of the distance matrix (the affinity matrix) and illustrate how spectral clustering may determine groups of similar time series and detect clusters of anomalous time series.
Our results on simulated data show that semimetrics perform as well or better than traditional metrics such as the Hausdorff and Wasserstein distances in all settings (no outliers, moderate outliers and extreme outliers). Moreover, our proposed family of semimetrics perform better than the newer modified Hausdorff semimetrics. Our two applications, cryptocurrency and measles, demonstrate that our method may detect similar time series while also identifying anomalous phenomena.
In summary, we have introduced a new computationally useful continuum of semimetrics which we term MJ distances and applied them to measure distances between sets of change points. As , MJ distances approach the Hausdorff distance. Thus, we have understood the Hausdorff distance in a new context within a family of MJ distances. Eigenvalue analysis, hierarchical clustering and spectral clustering can be used to analyse matrices of distances between sets of change points. Our analysis indicates that there is a tradeoff between transitivity and sufficient geometric mean averaging (robust results in the presence of outliers). Traditional metrics such as the Hausdorff and Wasserstein distances are severely impacted by outliers, while semimetrics with strong averaging properties such as the MJ fail the triangle inequality frequently, and in some cases, severely. Our experiments indicate that both the MJ and MJ distances are better compromises than the existing modified Hausdorff framework. In future work, we will aim to investigate and generalise other distance metrics outside the Hausdorff framework.
Appendix A Change point detection algorithm
Many statistical modelling problems require the identification of change points in sequential data. The general setup for this problem is the following: a sequence of observations are drawn from random variables and undergo an unknown number of changes in distribution at points . One assumes observations are independent and identically distributed between change points, that is, between each change points a random sampling of the distribution is occurring. Following Ross RossCPM, we notate this as follows:
While this requirement of independence may appear restrictive, dependence can generally be accounted for by modelling the underlying dynamics or drift process, then applying a change point algorithm to the model residuals or onestepahead prediction errors, as described by Gustafsson Gustafsson2000. The change point models applied in this paper follow Ross RossCPM.
a.1 Batch change detection (Phase I)
This phase of change point detection is retrospective. We are given a fixed length sequence of observations from random variables . For simplicity, assume at most one change point exists. If a change point exists at time , observations have a distribution of prior to the change point, and a distribution of proceeding the change point, where . That is, one must test between the following two hypotheses for each :
and end with the choice of the most suitable .
One proceeds with a twosample hypothesis test, where the choice of test is dependent on the assumptions about the underlying distributions. To avoid distributional assumptions, nonparametric tests can be used. Then one appropriately chooses a twosample test statistic and a threshold . If then the null hypothesis is rejected and we provisionally assume that a change point has occurred after . These test statistics are normalized to have mean and variance and evaluated at all values , and the largest value is assumed to be coincident with the existence of our sole change point. That is, the test statistic is then
where were our unnormalized statistics.
The null hypothesis of no change is then rejected if for some appropriately chosen threshold . In this circumstance, we conclude that a (unique) change point has occurred and its location is the value of which maximizes . That is,
This threshold is chosen to bound the Type 1 error rate as is standard in statistical hypothesis testing. First, one specifies an acceptable level for the proportion of false positives, that is, the probability of falsely declaring that a change has occurred if in fact no change has occurred. Then, should be chosen as the upper quantile of the distribution of under the null hypothesis. For the details of computation of this distribution, see RossCPM. Computation can often be made easier by taking appropriate choice and storage of the .
a.2 Sequential change detection (Phase II)
In this case, the sequence does not have a fixed length. New observations are received over time, and multiple change points may be present. Assuming no change point exists so far, this approach treats as a fixed length sequence and computes as in phase I. A change is then flagged if for some appropriately chosen threshold. If no change is detected, the next observation is brought into the sequence. If a change is detected, the process restarts from the following observation in the sequence. The procedure therefore consists of a repeated sequence of hypothesis tests.
In this sequential setting, is chosen so that the probability of incurring a Type 1 error is constant over time, so that under the null hypothesis of no change, the following holds:
In this case, assuming that no change occurs, the average number of observations received before a false positive detection occurs is equal to . This quantity is referred to as the average run length, or ARL0. Once again, there are computational difficulties with this conditional distribution and the appropriate values of , as detailed in Ross RossCPM.
Appendix B Proof of propositions
b.1 Proposition 3.1
Proof.
Since is the average of two quantities while is the maximum, we have so these are equivalent as semimetrics.
Now let be two sets with the following properties: assume
Moreover, assume there exists an element that is the closest element of to every element of . This property can hold easily if are each contained within convex sets respectively and is the closest element of to . Now slightly deform the elements of , moving them a sufficiently small distance to produce another set also containing . Then
However, if the elements of move, each distance will vary continuously with their movement, so will vary continuously with . This makes it a more precise measure of the distance between and . ∎
b.2 Proposition 3.2
Proof.
Replace with a duplicated multiset .
Then , , and .
Thus, , while
∎
b.3 Proposition 3.6
Proof.
Let , and . All four distances MJ, MH, MH, MH are symmetric in by definition, so axiom 2. holds.
All four distances are sums of powers of nonnegative real numbers, so are clearly nonnegative. Now assume either or is zero for any . Since all quantities in the summation of or are nonnegative, and all minimal distances are included, this forces for all , and for all . That is, every element in lies in and vice versa. This proves and establishes axiom 1 for all four distances. So all four are semimetrics.
Now turn to the triangle inequality. For MH, the triangle inequality fails easily when are much larger than . For example, let be a set of points very close to each other, let be another such set, and . Assume and respectively are all within of each other. Let and let . Then
So