Robust topological information commonly comes in the form of a set of persistence diagrams, finite measures that are in nature uneasy to affix to generic machine learning frameworks. We introduce a learnt, unsupervised measure vectorisation method and use it for reflecting underlying changes in topological behaviour in machine learning contexts. Relying on optimal measure quantisation results the method is tailored to efficiently discriminate important plane regions where meaningful differences arise. We showcase the strength and robustness of our approach on a number of applications, from emulous and modern graph collections where the method reaches state-of-the-art performance to a geometric synthetic dynamical orbits problem. The proposed methodology comes with only high level tuning parameters such as the total measure encoding budget, and we provide a completely open access software.
Topological Data Analysis (TDA) is a field dedicated to the capture and description of relevant geometric or topological information from data. The use of TDA with standard machine learning tools has proved particularly advantageous in dealing with all sorts of complex data, meaning objects that are not or partly Euclidean, for instance graphs, time series, etc. The applications are abundant, from social network analysis, bio and chemoinformatics, to medical imaging and computer vision, to name a few.
Through persistent homology, a multi-scale analysis of the topological properties of the data, robust and stable information can be extracted. The resulting features are commonly computed in the form of a persistence diagram whose structure (an unordered set of points in the plane representing birth and death times for the features) does not easily fit the general machine learning input format. Therefore TDA is generally combined to machine learning by way of an embedding method for persistence diagrams.
Contributions. Our work is set in that trend. First using recent measure quantisation results we introduce a learnt, unsupervised vectorisation method for measures in Euclidean spaces of any dimension (Section 2.1). Then we specialise this method for handling persistence diagrams (Section 2.2), allowing for easy integration of topological features into challenging machine learning problems with theoretical guarantees (Theorem 1). We illustrate our approach on a set of experiments that leads to state-of-the-art results on difficult problems (Section 3). Lastly we provide an open source implementation and notebook.
Our quantisation of the space of diagrams is statistically optimal and the resulting algorithm, a simple variant of the Lloyd’s algorithm, is simple and efficient. It can also be used in a minibatch fashion, making it practical for large scale and high dimensional problems, and competitive with respect to more sophisticated methods involving kernels, deep learning, or computations of Wasserstein distance. To the best of our knowledge, our results provide the first vectorisation method for persistence diagrams that is proven to be able to separate clusters. There is little to no tuning to this method, and no knowledge of TDA is required for this framework.
Related work. Finding representations of persistence diagrams that are well-suited to be combined with standard machine learning pipeline is a problem that has attracted a lot of interest these last years. A first family of approaches consists in finding convenient vector representations of persistence diagrams. For instance it involves interpreting diagrams as images in [Adams2017], extracting topological signatures with respect to fixed points whose optimal position are supervisedly learnt in [hofer2017deep], a square-root transform of their approximated pdf in [Anirudh].
A second family of approaches consists in designing specific kernel on the space of persistence diagrams, such as the multi-scale kernel of [Reininghaus2015], the weighted Gaussian kernel of [Kusano2016] or the sliced Wasserstein kernel of [Carriere2017]. Those techniques have state-of-the-art behaviour on problems, but for drawback they require another step for an explicit representation, and are known to scale poorly.
A recent other line of work has managed to directly combine the uneasy structure of persistence diagrams to neural networks architectures [Zaheer2017], [perslay]. Despite their successful performances, these neural networks are heavy to deploy and hard to understand. They are sometimes paired with a representation method as in [hofer2017deep], [hofer2019graph].
Persistent homology in TDA
Persistent homology provides a rigorous mathematical framework and efficient
algorithms to encode relevant multi-scale topological features of complex data such as point clouds, time-series, 3D images… More precisely, persistent homology encodes the evolution of the topology of families of nested topological spaces , called filtrations, built on top of the data and indexed by a set of real numbers that can be seen as scale parameters. For example, for a point cloud in a Euclidean space, can be the union of the balls of radius centered on the data points - see Figure 1. Given a filtration , its topology (homology) changes as increases: new connected components
can appear, existing connected components can merge, loops and cavities can appear or be filled, etc. Persistent homology tracks these changes, identifies features and associates, to each of them, an interval or lifetime from to .
For instance, a connected component is a feature that is born at the smallest such that the component is present in , and dies when it merges with an older connected component. The set of intervals representing the lifetime of the identified features
is called the barcode of the filtration. As an interval can also be represented
as a point in the plane with coordinates , the persistence barcode is equivalently represented as an union of such points and called the persistence diagram - see [edelsbrunner2010computational, boissonnat2018geometric] for a more detailed introduction.
The classical main advantage of persistence diagrams is that:
(i) they are proven to provide robust qualitative and quantitative topological information about the data [chazal2012structure];
(ii) since each point of the diagram represents a specific topological feature with its lifespan, they are easily interpretable as features;
(iii) from a practical perspective, persistence diagrams can be efficiently computed from a wide family of filtrations [gudhi].
However, as persistence diagrams come as unordered set of points with non constant cardinality, they cannot be immediately processed as standard vector features in machine learning algorithms. It can be beneficial to interpret persistence diagrams as measures, see for instance [chazal2012structure], [chazal2018], and in this work we will use this paradigm.
Consider the set of finite measures on the -dimensional ball of the Euclidean space with total mass smaller than , for some given .
We assume that the set of input persistence diagrams comes as an i.i.d. sample from a distribution of uniformly bounded diagrams, that is given , let be the space of persistence diagrams with at most points contained in the Euclidean disc . The space is considered as a subspace of the set of finite measures on with total mass smaller than : for any , where is the Dirac measure centered at point .
In this section we introduce Atol, a simple unsupervised data-driven method for measure vectorisation. Atol allows to automatically convert a distribution of persistence diagrams into a distribution of feature vectors that are well-suited for use as topological features in standard machine learning pipelines.
To summarise, given a positive integer , Atol proceeds in two steps: it first computes a discrete measure in supported on points that approximates the average measure of the distribution from which the input observations have been sampled. Second, it computes a set of well-chosen contrast functions centered on each point of the support of this measure, that are then used to convert each observation into a vector of size .
2.1 Measure vectorisation through quantisation
We now introduce Algorithm LABEL:alg:atolfeat Atol-featurisation in its generality, that is a featurisation method for elements of . The first step in our procedure is to use quantisation in space . Starting from an i.i.d. sample of measures drawn from probability distribution on and given an integer budget , we leverage recent algorithms and results from [levrard20] and produce a compact representation for the mean measure . That is, we produce a distribution supported on a fixed-length codebook in the ambient space that aims to minimize over such distribution the distorsion , the squared 2-Wasserstein distance to the mean measure. In practice, one considers the empirical mean measure and the -means problem for this measure. Then the respective adaptations of Lloyd’s [Lloyd82] and MacQueen’s [MacQueen67] algorithms to the format of measures are introduced in [levrard20] and can readily be employed.
From this quantisation our aim is to derive spatial information on measures in order to discriminate between them. Much like one would compactly describe a point cloud with respect to its barycenter in a PCA, we describe measures based on a number of reduced difference to our mean measure approximate. To this end, our second step is to tailor individual contrast functions each based on the estimated codebook that individually describe local regions. In other words we set to find regions of the space where measures seem to aggregate on average, and build a dedicated descriptor for those regions. We define and use the two following contrast families , for :
These specific contrast functions are chosen to decrease away from the approximate mean centroid in a Laplacian () or Gaussian () fashion. We choose the scale to roughly correspond to the minimum distance to the closest Voronoi cell in the corresponding codebook . The -exponential decrease will allow to properly separate a measure mixture in Theorem 1 so by default we make our arguments with that family in mind and will denote Atol and Atol the Gaussian and Laplacian versions of the algorithm. To our knowledge there is nothing that prevents other, well designed contrast families to be substituted in their place, but this is beyond the scope of this paper. Given a family of contrast function and a mean measure codebook approximate, each element of can now be compactly described through the integrated contribution to each contrast functions: for and contrast function , let
Our algorithm simply concatenates into a vector each of those contributions.
Calibration step 1 is optimal for deriving space quantisation in the following sense: let the excess distorsion be the difference between the distorsion of the resulting distribution based on codebook and the optimal distorsion of a distribution supported on points. Then for either version of the quantisation algorithm (batch or minibatch), the excess distorsion can be controlled (respectively with high probability or on average) at a minimax speed under margin conditions, see Theorems 4 and 5 from [levrard20].
2.2 Topological learning
We now specialise our featurisation method to the context of topological learning when we are set in dimension with . Applying Algorithm LABEL:alg:atolfeat to a collection from is straightforward and allows to embed the complex, unstructured space in Euclidean terms. Set in the context of a standard learning problem, we introduce Algorithm LABEL:alg:atol Atol: Automatic Topologically-Oriented Learning. Let with given observations in some space corresponding to a known, partially available or hidden label . Assume that one has a way to extract toplogical features from , i.e. to derive a collection of diagrams associated to those elements, and let be the corresponding map. Then applying Algorithm LABEL:alg:atolfeat to the resulting collection of diagrams provides some simplified topological understanding on elements of this problem.
Suppose now that persistence diagrams originate from distinct sources: assume that observed diagrams are sampled with noise from a mixture model of distinct measures — by that we mean that any two measures in this set differ by at least one point. Call the latent variable associated to the mixture so that . The following results ensures that has separative power, i.e. that the vectorisation clearly separates the different sources:
Theorem 1 (Separation with Atol).
For a given noise level assuming satisfies some (explicit) margin condition and for and large enough there exists a non-empty segment for in Equation (1) such that for all , with high probability:
This result follows from Corollary 19 in [levrard20] and the explicit statement of the assumptions and margin conditions are classical but rather technical and are fully described in [levrard20] (see Definition 3 for the margin conditions).
Notice that establishing that a persistence diagram vectorisation method allows for separation, i.e. that diagrams from different clusters will be well-discriminated, has never been achieved to our knowledge. It is not sufficient to find a space quantisation that allows to discriminate a collection of diagrams from , the vectorisation based upon this quantisation could still miss a difference of interest depending on the chosen contrast family . The above theorem shows that Atol overcomes this issue.
Note that in order to adjust the values for in Equation (1), we use a common heuristic instead of constant values and it is our intuition that the chosen, adaptive values of Equation (3) help the vectorisation perform better than what the theory can predict.
We point that as embedding map is automatically computed without knowledge of a learning task, its derivation is fully unsupervised. The representation is learned since it is data-dependent, but it is also agnostic to the task and only depends on getting a glimpse at an average persistence diagram. Using the minibatch quantisation step of [levrard20] is single-pass so the vectorisation algorithm has linear computation time in , therefore it is able to handle high-dimensional problems as long as corresponding diagrams are provided.
This featurisation is conceptually close to two other recent works. [hofer2017deep] computes a persistence diagram vectorisation through a deep learning layer that adjusts Gaussian contrast functions used to produce topological signatures much like our Calibration step 2. So in essence our approach substitutes quantisation to deep learning, with no need of supervision and allowing to provide mathematical guarantees. Next, the bag of word method of [zelinski2019] uses an ad-hoc form of quantisation for the space of diagrams, then count functions as contrast functions to produce histograms as topological signatures. There are in fact sensible differences, that will ultimately translate in terms of effectiveness: Section 3.1 shows the Atol-featurisation to produce state-of-the-art mean accuracy on two difficult multi-class classification problems (66.9 % on REDDIT5K and 51.6 % on REDDIT12K) that are also tackled by those papers where [hofer2017deep] report a mean accuracy of respectively 54.5% and 44.5%, and [zelinski2019] report an accuracy of respectively 49.9% and 38.6%.
3 Competitive TDA-Learning
In this section we demonstrate experimentally the advantages of our approach. We show the Atol framework to be competitive and state-of-the-art, but also versatile and easy to use with high automaticity.
3.1 Graph Classification
|REDDIT (5K, 5 classes)||56.1.5||47.8||59.5.6||—||55.6.3||66.9.3||67.3|
|REDDIT (12K, 11 classes)||48.7.2||—||48.5.5||—||47.7.2||51.6.1||51.6.2|
|COLLAB (5K, 3 classes)||81.0.3||80.0||—||83.6.1||76.4.4||87.8.2||88.1.1|
|IMDB-B (1K, 2 classes)||71.91.||73.6||75.11.1||76.93.6||71.2.7||74.3.8||74.5.5|
|IMDB-M (1.5K, 3 classes)||47.7.3||52.4||48.4.5||52.84.6||48.8.6||47.8.8||48.3.7|
Learning problems involving graph data are receiving a strong interest at the moment, consider graph classification: is a finite family of graphs and available labels and one learns to map .
Recently [perslay] have introduced a powerful way of extracting topological information from graph structures. They make use of heat kernel signatures (HKS) for graphs [hu2014stable], a spectral family of signatures (with diffusion parameter ) whose topological structure can be encoded in the extended persistence framework, yielding four types of topological features with exclusively finite persistence. On both those points we refer to Sections 4.2 and 2 from [perslay]. Therefore for each graph and HKS diffusion time the resulting topological descriptor are four persistence diagrams with all finite coordinates. For the entire set of problems to come we choose to use the same two HKS diffusion times to be and , fueling the extended graph persistence framework and resulting in 8 persistence diagrams per considered graph. For budget in Algorithm LABEL:alg:atol we choose for all experiments, which means Algorithm LABEL:alg:atolfeat will rely on approximating the mean measure on ten points per diagram type and filtration. We make no use of (and automatically discard) graph attributes on edges or vertices that some dataset do possess, and no other sort of features are collected, so that our results are solely based on the graph structure of the problems. To sum up, Algorithm LABEL:alg:atol here simply consists in reducing the original problem from to with . We stress that the embedding map from Algorithm LABEL:alg:atolfeat is computed each time using all diagrams from the training set, without supervision. To measure the worth of this embedding in this learning context, we evaluate the featurisation for classification purposes using the standard scikit-learn [sklearn] random-forest classification tool with trees and all other parameters set as default. On each problem we perform a 10-fold cross-validation procedure and average the resulting accuracies; we report accuracies and standard deviations over ten such experiments.
|MUTAG (188)||90.31.1||92.1||88.32.6||90.08.5||89.8.9||86.7 .8||87.5 .6|
|COX2 (467)||81.4.6||—||—||—||80.91.||78.8 .5||79.41.3|
|DHFR (756)||81.5.9||—||—||—||80.3.8||81.9 .8||83.1 .8|
|PROTEINS (1113)||78.0.3||73.4||78.5.4||75.64.2||74.8.3||72.7 .4||72.4.5|
|NCI1 (4110)||84.5.2||79.8||87.5.5||84.21.5||73.5.3||78.8 .3||79.9 .2|
|NCI109 (4127)||—||78.8||87.4.3||—||69.5.3||77.6 .2||78.5.3|
|FRNKNSTN (4337)||76.4.3||—||—||—||70.7.4||72.8 .2||73.1.3|
We use two sets of graph classification problems for benchmarking, one of Social Network origin and one of Chemoinformatics and Bioinformatics origin. They include small and large sets of graphs (MUTAG has 188 graphs, REDDIT12K has 12000), small and large graphs (IMDB-M has 13 nodes on average, REDDIT5K has more than 500), dense and sparse graphs (FRANKENSTEIN has around 12 edges per nodes, COLLAB has more than 2000), binary and multi-class problems (REDDIT12K has 11 classes), all available in the public page [benchmark]. Computations are run on a single laptop (i5-7440HQ 2.80 GHz CPU), in batch version for datasets smaller than a thousand observations and mini-batch version otherwise. Average computing time of Algorithm LABEL:alg:atolfeat (the average time to calibrate the vectorisation map on the training set then compute the vectorisation on the entire dataset), are: less than .1 seconds for datasets with less than a thousand observations, less than 10 seconds for datasets that have less than 5 thousand observations, 25 seconds for REDDIT-5K, 50 seconds for REDDIT-12K and 110 seconds for the densest problem COLLAB. The results presented here are openly accessible (requiring open source library Gudhi [gudhi] and reproducible with the public repository github.com/martinroyer/atol.
We compare performances to the top scoring methods for these problems, to the best of our knowledge. Those methods are mostly graph kernels methods tailored to graph problems: two graph kernel methods based on random walks (RetGK1, RetGK11 from [zhang2018retgk]), one graph embedding method based on spectral distances (FGSD from [verma2017hunt]), two topological graph kernel method (WKPI-kM and WKPI-kC from [qi2019]), one graph kernel combined with a graph neural network (GNTK from [du19]) and one topological vectorisation method learnt by a neural network (PersLay from [perslay]). Competitor accuracy are quoted from their respective publication and we detail how they should be interpreted: for RetGK and WKPI and PersLay the evaluating procedure is done over ten 10-fold, just as ours is so the results directly compare; for FGSD the average accuracy over a single 10-fold is reported, and for GNTK the average accuracy and deviations is reported over a single 10-fold as well. When there are two or more methods under one label, we always report the best outcome.
Our results Table 1 are state-of-the-art or substantially improving the state-of-the-art on the Large Social Network datasets that are rather difficult multi-class problems. The results on the Chemoinformatics and Bioinformatics datasets Table 2 are state-of-the-art. These results are especially postive seeing how Algorithm LABEL:alg:atolfeat is generic and has been designed neither for graph experiments nor for persistence diagrams specifically, and seing how the classification task has been entrusted to an external and generic learning tool. Contrary to competitors, the method does not require to construct a kernel or a neural network. Overall, the simplicity and absence of tuning hint at robustness and good generalisation power.
3.2 Discrete dynamical systems seen as measures
[Adams2017] use a synthetic, discrete dynamical system (used to model flows in DNA microarrays) with the following property: the resulting chaotic trajectories exhibit distinct topological characteristics depending on a parameter . The dynamical system is:
With random initialisation and five different parameters , a thousand iterations per trajectory and a thousand orbits per parameter, a datasets of five thousand orbits is constituted and commonly used for evaluating topological methods. Figure 2 shows a few orbits generated with parameters . For orbits generated with parameter , it happens that the initialisation spawns close to an attractor point that gives it the special shape as in the leftmost orbit. The problem of classifying this datasets in accordance to their underlying parameter is rather uneasy and challenging. Some competitive topological methods have tackled this problem in the following way: after a learning phase with a 70/30 split, accuracy with the standard deviation over a hundred such experiments. The following results have been reported: 72.382.4 [Reininghaus2015], 76.630.7 [Kusano2016], 83.60.9 [Carriere2017], 85.90.8 [Le2018], and the state-of-the-art 87.71.0 with persistence diagrams in [perslay].
Since those discrete orbits can be seen as measures in , we apply our learning framework directly on the observed point cloud i.e. we use Algorithm LABEL:alg:atolfeat on the synthetic orbits and for learning we use the scikit-learn [sklearn] random-forest classification tool (with trees and all other parameters set as default) on the resulting vector. Note that in this context, our framework resembles that of image classification where instead of a fixed grid for measurement we have learnt centers from which to look at the data. After learning the center importance can be represented (see an example with 80 centers Figure 3) and naturally centers that can gather important geometrical information gain importance in this process.
Using the same 70/30 split procedure and repeating a hundred times with the -Laplacian contrast family and centers (so in the exact same configuration as the graph experiment), we obtain 88.3.8 mean accuracy and deviation. Therefore our results are also competitive for this high dimensional problem. But what is more, increasing the budget on this experiment yields sensible gains: using the -Gaussian contrast family and a centers for cloud description allows to reach 95% accuracy or more, so it seems that this problem can be precisely described by a purely spatial approach — and our framework can be labeled as such in this context.
We also use this synthetical dataset to present additional experiments displayed Table 3, designed to understand parameter influence of Algorithm LABEL:alg:atolfeat. The considered parameters are: (i) a high or low budget for describing the measure space, (ii) possible effect of the contrast functions or to use for vectorisation of the quantised space, (iii) the proportion of training observations to use for deriving the quantisation, with 10% indicating that all a random selection of a tenth of the measures from the training set were used to calibrate Algorithm LABEL:alg:atolfeat.
Naturally it is expected that augmenting the budget for vectorising the measure space will yield a better description of said space, and this intuition is confirmed by Table LABEL:alg:atolfeat. But we stress that a weakness of Algorithm LABEL:alg:atolfeat is that once centers are fixed in the quantisation, space regions that are too far from these centers will necessarily be left out and the information they can carry with them. Therefore this intuition can sometimes be wrong if a lower number of centers happens to lead to a more pertinent quantisation. Next, the influence of the chosen contrast functions clearly show the Gaussian contrast functions to perform better than the other. Understanding the ability of such contrast functions to describe some particular observation space is challenging and left for future work. Lastly, ORBIT5K seems to be a dataset where the percentage of observations used in the calibration part of the algorithm does not weigh much on the final result for a budget (it does have a significant influence when the budget is lower). This tells us that the calibration can be stable for a given level of information.
|Budget effect||Contrast functions||Calibration effect|
|16.5 s||19.8 s||20.1 s||25.8 s||65.8 s||26 s||25.8 s||25.8 s||25 s||26 s|
3.3 Topological score for time series, an industrial application
Finally we present an industrial application for time series, in a case where the learning problem is hard and no obvious solutions are to be found. This dataset consists in the following experiments: using commercially available simulator of a Japanese city road circuit course, about a hundred subjects are monitored and the intervals between successive heartbeats are recorded (RRI data sampled at 4Hz) for a 80 minutes drive that includes two periods of high-speed driving at the beginning and at the end of the experiment, and a low-speed driving period in the middle designed to induce sleepiness. For each experiment, an expert annotation (labeled NEDO score) produced from visual observation of the driver is made available, indicating sleepiness on a 1 to 5 class scale. We show four such experiments in Figure 4 (the RR-intervals have then been normalised).
This problem of retrieving the sleepiness level based on RRI levels is hard and ill-posed: there are strong individual differences in perceived reaction to a given situation, a single experiment per subject to learn behaviour from, and apparent noise or absence of signal in annotations, see e.g. subject 3 in Figure 4. Nevertheless we propose to use the Atol framework to produce features meant reflect the sleepiness level in subjects based on RRI variations. The intent is that even though this will poorly reflect the latent sleepiness level, this could be enough to allow to catch jumps in the perceived attention level. The framework can readily be applied to time series in any given dimension and used to produce topological features. For this application we will follow a classical path: (i) use a sliding window decomposition on the RRI time-series, (ii) use a time-delay embedding to transform said window into a point cloud, (iii) apply persistent homology analysis (we use DTM-filtration [anai2018]) to produce persistence diagrams and (iv) vectorise persistence diagrams using Algorithm LABEL:alg:atolfeat.
We concatenate those features with the mean and standard deviation statistics on the sliding-window. As for learning, we compute a learner based on other individuals’ features regressed to their NEDO scores, and use it to generate a score based on Atol features (see middle and bottom row in Figure 5). Although this score imperfectly reflects the underlying NEDO score for a given patient, is can still have some uses. We set to detect two jumps on this topologically-augmented score using a Gaussian Kernel. We also compute a regressor based on the standard features without additional topological features, for comparison purposes, and also detect two jumps on this standard score.
Figure 5 shows two example results of our analysis. Each panel (top and bottom) consists in three time-series: the (hidden) NEDO score (top row), the Atol-score computed from a regressor based on topological features (middle row), and a standard score computed from a regressor based solely on standard features (bottom row). The changes of colour from blue to red and to blue indicates the changes in the experimental design for the driving simulation, i.e. the red portion indicates low-speed driving whereas the blue portions indicate high-speed driving periods. The black dotted lines indicate jumps detected from the Atol representation, whereas the red dotted lines indicate jumps detected from the standard representation. In the top panel, the two series of jumps are concomitant, and almost an exact match to the underlying changes in the experimental design. In the bottom panel, an improvement over the standard score is caught with the Atol score that better reflects the changes in latent NEDO score for this subject, two the point that the detected jumps are an exact match for the changes in experimental conditions. Overall, the Atol score has less spikes and more regularity than the standard score, which is expected as the topological features are extracted posterior to a time-delay embedding procedure.
This paper introduces a vectorisation for measures in Euclidean spaces based on optimal quantisation procedures, then shows how this method can be employed in machine learning context and help process topological features. Atol has a rather simple design, is multifaceted and ties theoretical guarantees to practical efficiency.
Moreover Atol only depends on few simple parameters, namely the size of the codebook and the choice of contrast functions. The study of the effect and the design of a method for automatic choice of these parameters deserves further analysis and is left to future work.