Learning metrics for persistence-based summaries and applications for graph classification
Recently a new feature representation and data analysis methodology based on a topological tool called persistent homology (and its corresponding persistence diagram summary) has started to attract momentum.
A series of methods have been developed to map a persistence diagram to a vector representation so as to facilitate the downstream use of machine learning tools, and in these approaches, the importance (weight) of different persistence features are often pre-set. However often in practice, the choice of the weight-function should depend on the nature of the specific type of data one considers, and it is thus highly desirable to learn a best weight-function (and thus metric for persistence diagrams) from labelled data. We study this problem and develop a new weighted kernel, called WKPI, for persistence summaries, as well as an optimization framework to learn a good metric for persistence summaries. Both our kernel and optimization problem have nice properties. We further apply the learned kernel to the challenging task of graph classification, and show that our WKPI-based classification framework obtains similar or (sometimes significantly) better results than the best results from a range of previous graph classification frameworks on a collection of benchmark datasets.
In recent years a new data analysis methodology based on a topological tool called persistent homology has started to attract momentum in the learning community. The persistent homology is one of the most important developments in the field of topological data analysis in the past two decades, and there have been fundamental development both on the theoretical front (e.g, [20, 9, 11, 7, 12]), and on algorithms / efficient implementations (e.g, [38, 4, 13, 17, 26, 3]). On the high level, given a domain with a function defined on it, the persistent homology provides a way to summarize “features” of across multiple scales simultaneously in a single summary called the persistence diagram (see the lower left picture in Figure 1). A persistence diagram consists of a multiset of points in the plane, where each point intuitively corresponds to the birth-time () and death-time () of some (topological) feature of w.r.t. . Hence it provides a concise representation of , capturing multi-scale features of it in a concise manner. Furthermore, the persistent homology framework can be applied to complex data (e.g, 3D shapes, or graphs), and different summaries could be constructed by putting different descriptor functions on input data.
Due to these reasons, a new persistence-based feature vectorization and data analysis framework (see Figure 1) has recently attracted much attention. Specifically, given a collection of objects, say a set of graphs modeling chemical compounds, one can first convert each shape to a persistence-based representation. The input data can now be viewed as a set of points in a certain persistence-based feature space. Equipping this space with appropriate distance or kernel, one can then perform downstream data analysis tasks, such as clustering or classification.
The original distances for persistence diagram summaries unfortunately do not lend themselves easily to machine learning tasks. Hence in the last few years, starting from the persistence landscape , there have been a series of methods developed to map a persistence diagram to a vector representation to facilitate machine learning tools. Recent ones include Persistence Scale-Space kernel , Persistence Images , Persistence Weighted Gaussian kernel (PWGK) , Sliced Wasserstein kernel , and Persistence Fisher kernel .
In these approaches, when computing the distance or kernel between persistence summaries, the importance (weight) of different persistence features are often pre-determined (e.g, either with uniform weights, or weighted by its persistence). In persistence images  and PWGK , the importance of having a weight-function for the birth-death plane (containing the persistence points) has been emphasized and explicitly included in the formulation of their kernels. However, before using these kernels, the weight-function needs to be pre-set.
On the other hand, as recognized by , the choice of the weight-function should depend on the nature of the specific type of data one considers. For example, for the persistence diagrams computed from atomic configurations of molecules, features with small persistence could capture the local packing patterns which are of utmost importance and thus should be given a larger weight; while in many other scenarios, small persistence typically leads to noise with low importance. However, in general researchers performing data analysis tasks may not have such prior insights on input data. Thus it is natural and highly desirable to learn a best weight-function from labelled data.
Our work and contributions.
In this paper, we study the problem of learning an appropriate metric (kernel) for persistence-based summaries from labelled data, as well as applying the learnt kernel to the challenging graph classification task. Our contributions are two-fold.
- Metric learning for persistence summaries:
We propose a new weighted-kernel (called WKPI), for persistence summaries based on persistence images representations. Our new WKPI kernel is positive semi-definite and its induced distance has stability property. A weight-function used in this kernel can directly encodes the importance of different locations in the persistence diagram. We next model the metric learning problem for persistence summaries as the problem of learning (the parameters of) this weight-function from a certain function class. In particular, we develop a cost function and the metric-learning is then formulated as an optimization problem. Interestingly, we show that this cost function has a simple matrix view which helps to both conceptually clarify its meaning and simplify the implementation of its optimization.
- Graph classification application:
Given a set of objects with class labels, we first learn a best WKPI-kernel as described above, and then use the learned WKPI to further classify objects. We implemented this WKPI-classification framework, and apply it to a range of graph data sets. Graph classification is an important problem, and there has been a large literature on developing effective graph representations (e.g, [22, 36, 2, 27, 39, 42, 34]) and graph neural networks (e.g, graph neural networks [43, 35, 41, 40, 30]) for classifying graphs. The problem is challenging as graph data are less structured and more complex. We perform our WKPI-classification framework on a range of benchmark graph data sets as well as new data sets (modeling neuron morphologies). We show that the learned WKPI is consistently much more effective than other persistence-based kernels. Most importantly, when compared with several existing state-of-the-art graph classification frameworks, our framework shows similar or (sometimes significantly) better performance in almost all cases than the best results by existing approaches 111Several datasets are attributed graphs, and our new framework achieves these results without even using those attributes.. Given the importance of the graph classification problem, we believe that this is an independent and important contribution of our work.
We note that  is the first to recognize the importance of using labelled data to learn a task-optimal representation of topological signatures. They developed an end-to-end deep neural network for this purpose, using a novel and elegant design of the input layer to implicitly learn a task-specific representation. We instead explicitly formulate the metric-learning problem for persistence-summaries, and decouple the metric-learning (which can also be viewed as representation-learning) component from the downstream data analysis tasks. Also as shown in Section 4, our WKPI-classification framework (using SVM) achieves better results on graph classification datasets.
2 Persistence-based framework
We first give an informal description of persistent homology below. See  for more detailed exposition on the subject.
Suppose we are given a shape (in our later graph classification application, is a graph). Imagine we inspect through a filtration of , which is a sequence of growing subsets of : . As we scan , sometimes a new feature appears in , and sometimes an existing feature disappears upon entering . Using the topological object called homology classes to describe these features (which intuitively capture components, independent loops / voids, and their high dimensional counter-parts), the birth and death of topological features can then be captured by the persistent homology, in the form of a persistence diagram . Specifically, consists of a multi-set of points in the plane (which we call the birth-death plane ), where each point () in it, called a persistence-point, indicates that a certain homological feature is created upon entering and destroyed upon entering . A common way to obtain a meaningful filtration of is via the sublevel-set filtration induced by a descriptor function on . More specifically, given a function , let be its sublevel-set at . Let be real values. The sublevel-set filtration w.r.t. is: and its persistence diagram is denoted by . Each persistence-point indicates the function values when some topological features are created (when entering ) and destroyed (in ), and the persistence of this feature is its life-time . See Figure 2 (a) for a simple example where . If one sweep top-down in decreasing function values, one gets the persistence diagram induced by the super-levelset filtration of w.r.t. in an analogous way. Finally, if one tracks the change of topological features in the levelset , one obtains the so-called levelset zigzag persistence .
The persistent homology provides a generic yet powerful way to summarize a space . Even when the space is complex, say a graph, we can still map it to a persistence diagram via appropriate descriptor functions. Furthermore, a different descriptor function provides a different perspective of , and its persistence diagram summarizes features of at all scales w.r.t. this perspective.
If we are given a collection of shapes , we can compute a persistence diagram for each , which maps the set to a set of points in the space of persistence diagrams. There are natural distances defined for persistence diagrams, including the bottleneck distance and the Wasserstein distance, both of which have been well studied (e.g, stability under these distances [14, 15, 12]) with efficient implementations available [24, 25]. However, to facilitate downstream machine learning / data analysis tasks, it is desirable to further map the persistence diagrams to another representation (e.g, in a Hilbert space). Below we introduce one such representation, called the persistence images , as our new kernel is based on it.
Let be a persistent diagram (containing a multiset of persistence-points). Set to be the linear transformation where for each , . Let be the transformed diagram of . Let be a differentiable probability distribution with mean (e.g, the normalized Gaussian where for any , .
Definition 2.1 ()
Let be a non-negative weight-function for the persistent plane . Given a persistent diagram P, its persistence surface (w.r.t. ) is defined as: for any ,
See Figure 2 (b) for an example. Adams et al. further “discretize” the 2D persistence surface to map it to a finite vector. In particular, fix a grid on a rectangular region in the plane with a collection of rectangles (pixels).
Definition 2.2 ()
Given a persistence diagram , its persistence image consists of numbers, one for each pixel in the grid with
The persistence image can be viewed as a vector in . One can then compute distance between two persistence diagrams and by the -distance between their persistence images (vectors) and . The persistence images have several nice properties, including stability guarantees; see  for more details.
3 Metric learning frameworks
Suppose we are given a set of objects (sampled from a hidden data space ), classified into classes. We want to use these labelled data to learn a good distance for (persistence-image representations of) objects from which hopefully is more appropriate at classifying objects in the data space . To do so, below we propose a new persistence-based kernel for persistence images, and then formulate an optimization problem to learn the best weight-function so as to obtain a good distance metric for (and data space ).
3.1 Weighted persistence image kernel (WKPI)
From now on, we fix the grid (of size ) to generate persistence images (so a persistence image is a vector in ). Let denote the center of the -th pixel in , for . We now introduce a new kernel for persistence images. A weight-function refers to a non-negative real-valued function defined on .
Let be a weight-function. Given two persistence images and , the (-)weighted persistence image kernel (WKPI) is defined as:
Remark 1: We could use the persistence surfaces (instead of persistence images) to define the kernel (with the summation replaced by an integral). Since for computational purpose, one still needs to approximate the integral in the kernel via some discretization, we choose to present our work using persistence-images directly. Our Theorems 3.2 and 3.4 still hold (with slightly different stability bound) if we use the kernel defined for persistence surfaces.
Remark 2: One can choose the weight-function from different function classes. Two popular choices are: mixture of 2D Gaussians; and degree- polynomials on two variables.
Remark 3: There are other natural choices for defining a weighted kernel for persistence images. For example, we could use , which we refer this as altWKPI. Alternatively, one could use the weight function used in the persistent-weighted Gaussian kernel (PWGK)  directly. Indeed, we have implemented all these choices, and our experiments show that our WKPI kernel leads to better results than these choices for all datasets (see Appendix B.4). Furthermore, we note that the square of our WKPI-distance depends on linearly, which is much simpler when computing gradients for our cost function later when using the chain-rule. In addition, note that PWGK kernel  contains cross terms in its formulation, meaning that there are quadratic number of terms (w.r.t the number of persistence points) to calculate the kernel. This makes it more expensive to compute and learn for complex objects (e.g, for the neuron data set, a single neuron tree could produce a persistence diagrams with hundreds persistence points).
The WKPI kernel is positive semi-definite.
By the above result, the WKPI kernel gives rise to a Hilbert space. We can now introduce the following WKPI-distance induced by the inner product on this Hilbert space.
Given two persistence diagrams and , let and be their corresponding persistence-images. Given a weight-function , the (-weighted) WKPI-distance is defined as:
Stability of WKPI-distance.
Given two persistence diagrams and , two traditional distances between them are the bottleneck distance and the -th Wasserstein distance . Stability of these two distances w.r.t. changes of input objects or functions defined on them have been studied [14, 15, 12]. Similar to the stability study on persistence images, below we prove WKPI-distance is stable w.r.t. small perturbation in persistence diagrams as measured by . (Intuitively, view two persistence diagrams and as two (appropriate) measures, and is then the “earth-mover” distance between them so as to convert the measure corresponding to to that for .)
To simplify the presentation of Theorem 3.4, we use unweighted persistence images w.r.t. Gaussian, meaning in Definition 2.1, (1) the weight function is the constant function ; and (2) the distribution is the Gaussian . The proof of the following theorem can be found in Appendix A.2.
Remarks: We can obtain a more general bound for the case where the distribution is not Gaussian. Furthermore, we can obtain a similar bound when our WKPI-kernel and its induced WKPI-distance is defined using persistence surfaces instead of persistence images. We omit these from this short version of the paper.
3.2 Optimization problem for metric-learning
Suppose we are given a collection of objects (sampled from some hidden data space ), already classified (labeled) to classes . In what follows, we say that if has class-label . We first compute the persistence diagram for each object . (The precise filtration we use to do so will depend on the specific type of objects. Later in Section 4, we will describe filtrations used for graph data). Let be the resulting set of persistence diagrams. Given a weight-function , its induced WKPI-distance between and can also be thought of as a distance for the original objects and ; that is, we can set . Our goal is to learn a good distance metric for the data space (where are sampled from) from the labels. We will formulate this as learning a best weight-function so that its induced WKPI-distance fits the class-labels of ’s best. Specifically, for any , set:
Intuitively, is the total in-class (square) distances for ; while is the total distance from objects in class to all objects in . A good metric should lead to relatively smaller distance between objects from the same class, but larger distance between objects from different classes. We thus propose the following optimization problem:
Definition 3.5 (Optimization problem)
Given a weight-function , the total-cost of its induced WKPI-distance over is defined as:
The optimal distance problem aims to find the best weight-function from a certain function class so that the total-cost is minimized; that is:
Matrix view of optimization problem.
We observe that our cost function can be re-formulated into a matrix form. This provides us with a perspective from the Laplacian matrix of certain graphs to understand the cost function, and helps to simplify the implementation of our optimization problem, as several programming languages popular in machine learning (e.g Python and Matlab) handle matrix operations more efficiently (than using loops).
More precisely, recall our input is a set of objects with labels from classes. We set up the following matrices:
If we view as a distance matrix of objects , is then its Laplacian matrix. The technical proof of the following main theorem can be found in Appendix A.3.
The total-cost can also be represented by , where is the trace of a matrix. Furthermore, , where is the identity matrix.
Note that all matrices, and , are dependent on the (parameters of) weight-function , and in the following corollary of Theorem 3.6, we use the subscript of to emphasize this dependence.
Optimal distance problem is equivalent to
Solving the optimization problem.
In our implementation, we use (stochastic) gradient decent to find a (locally) optimal weight-function for the minization problem. We use the matrix view as given in Corollary 3.7, and minimizing subject to . We briefly describe our procedure where we assume that the weight-function is from the class of mixture of number of 2D non-negatively weighted (spherical) Gaussians. Each weight-function is thus determined by parameters with .
From the proof of Theorem 3.6 (in the appendix), it turns out that condition is satisfied as long as the multiplicative weight of each Gaussian in the mixture is non-negative. Hence during the gradient descent method, we only need to make sure that this holds 222 In our implementation, we add a penalty term to total-cost , to achieve this in a “soft” manner.. It is easy to write out the gradient of w.r.t. each parameter in matrix form. For example,
While this does not improve the asymptotic complexity of computing the gradient (compared to using the formulation of cost function in Definition 3.5), programming languages such as Python and Matlab can implement these matrix operations much more efficiently than using loops. For large data sets, one can use stochastic gradient decent, by sampling a subset of number of input persistence images, and compute the matrices as well as the cost using the subsampled data points. In our implementation, we use Armijo-Goldstein line search scheme to update the parameters in each (stochastic) gradient decent step. The optimization procedure terminates when the cost function converges or the number of iterations exceeds a threshold.
In this section, we show the effectiveness of our metric-learning framework and the usefulness of the learned metric via graph classification applications. In particular, given a set of graphs coming from classes, we first compute the unweighted persistence images for each graph , and apply the framework from Section 3.1 to learn the “best” weight-function on the birth-death plane from these persistence images . Next, we perform graph classification using kernel-SVM with the learned -WKPI kernel. We refer to this framework as WKPI-classification framework.
We show two family of experiments below. In section 4.1 we show that our learned WKPI kernel significantly outperforms existing persistence-based representations. In Section 4.2, we compare the performance of WKPI-classification framework with various state-of-the-art methods for the graph classification task over a range of data sets.
Setup for our WKPI-based framework.
In all our experiments, we assume that the weight-function comes from the class of mixture of 2D non-negatively weighted Gaussians as described in the end of Section 3.2. Furthermore, all Gaussians are isotropic with the same standard deviation (width) . We take and this width as hyperparameters: Specifically, we search among and and determine their final choices via 10 times 10 fold cross validation. We repeat the process 10 times that spliting each dataset into 10 folds, and performing 10-fold cross validation. In each 10-fold cross validation, 9 folds are used for training and 1 for testing, and we repeat the 9:1 train-test experiments 10 times.
One important question is to initialize the centers of the Gaussians in our mixture. There are three strategies that we consider. (1) We simply sample centers in the domain of persistence images randomly. (2) We collect all points in the persistence diagrams derived from the training data , and perform a k-means algorithm to identify means. (3) We perform a k-center algorithm to those points to identify centers. Strategies (2) and (3) usually outperform strategy (1). Thus in what follows we only report results from using k-means and k-centers as initialization, referred to as WKPI-kM and WKPI-kC, respectively.
|Datasets||Existing approaches||Alternative metric learning||Our WKPI framework|
4.1 Comparison with other persistence-based methods
We compare our methods with state-of-the-art persistence-based representations, including the Persistence Weighted Gaussian Kernel (PWGK) , original Persistence Image (PI) , and Sliced Wasserstein (SW) Kernel . Furthermore, as mentioned in Remark 3 after Definition 3.1, we can learn weight functions in PWGK by the optimizing the same cost function (via replacing our WKPI-distance with the one computed from PWGK kernel); and we refer to this as trainPWGK. We can also use an alternative kernel for persistence images as described in Remark 3, and then optimize the same cost function using distance computed from this kernel; we refer to this as altWKPI. We will compare our methods both with existing approaches, as well as with these two alternative metric-learning approaches (trainPWGK and altWKPI).
Neuron cells have natural tree morphology, rooted at the cell body (soma), with dendrite and axon branching outm, and it is common in the field of neuronscience to model a neuron as a (geometric) tree. See Figure 4 in the appendix for an example. Our Neuron-Binary dataset consists of 1126 neuron trees classified into two (primary) classes: interneuron and principal neurons (data partly from the Blue Brain Project  and downloaded from(http://neuromorpho.org/). The second Neuron-Multi dataset is a refinement of the 459 neurons from the interneuron class into four (secondary) classes – hence neurons in dataset Neuron-Multi all come from one class of dataset Neuron-Binary.
Generation of persistence.
Given a neuron tree , following , we use the descriptor function where is the geodesic distance from to the root of along the tree. In Neuron-Multi, to differentiate the dendrite and axon part of a neuron cell, we further negate the function value if a point is in the dendrite. We then use the union of persistence diagrams induced by both the sublevel-set and superlevel-set filtrations w.r.t. . Under these filtrations, intuitively, each point in the birth-death plane corresponds to the creation and death of certain branch feature for the input neuron tree. The set of persistence diagrams obtained this way (one for each neuron tree) is the input to our WKPI-classification framework.
Results on neuron datasets.
The classification accuracy of various methods is given in Table 1. To obtain these results, we split the training cases and testing cases with the ratio 1:1 for both datasets. As the number of trees is not large, we use all training data to compute the gradients for cost functions in the optimization process instead of mini-batch sampling. Our optimization procedure terminates when the change of the cost function remains smaller than or the iteration number exceeds 2000. Persistence-images are both needed for the methodology of  and as input for our WKPI-distance, and its resolution is fixed at roughly (see Appendix B.2 for details). For persistence image (PI) approach of , we experimented both with the unweighted persistence images (PI-CONST), and one, denoted by (PI-PL), where the weight function is a simple piecewise-linear (PL) function adapted from what’s proposed in ; see Appendix B.2 for details. Since PI-PL performs better than PI-CONST on both datasets, Table 1 only shows the results of PI-PL.
Note that while our WKPI-framework is based on persistence images (PI), our classification accuracy is much better. We also point out that in our results here as well as later for graph classification task, our method consistently outperforms all other persistence-based representations, often by a large margin; see Appendix B.4 for comparison of our methods with these existing persistence-based frameworks on graph classification.
In Figure 3 we show the heatmap of the learned weight-function for both datasets. Interestingly, we note that the important branching features (points in the birth-death plane with high values) separating the two primary classes (i.e, for Neuron-Binary dataset) is different from those important for classifying neurons from one of the two primary classes (the interneuron class) into the four secondary classes (i.e, the Neuron-Multi dataset). Also high importance (weight) points may not have high persistence. In the future, it would be interesting to investigate whether the important branch features are also biochemically important.
4.2 Graph classification task
|Dataset||Previous approaches||Our appraches|
|Dataset||Previous approaches||Our appraches|
Benchmark datasets and comparison methods.
We use a range of benchmark datasets (collected from recent literature) including: (1) several datasets on graphs derived from small chemical compounds or protein molecules: NCI1 and NCI109 , PTC , PROTEIN , DD  and MUTAG ; (2) two datasets on graphs representing the response relations between users in Reddit: REDDIT-5K (5 classes) and REDDIT-12K (11 classes) ; and (3) two datasets on IMDB networks of actors/actresses: IMDB-BINARY (2 classes), and IMDB-MULTI (3 classes). See Appendix B.3 for descriptions of these datasets, and their statistics (sizes of graphs etc).
Many graph classification methods have been proposed in the literature, with different methods performing better on different datasets. Thus we include a large number of approaches to compare with: six graph-kernel based approaches: RetGK, FGSD, Weisfeiler-Lehman kernel (WL), Weisfeiler-Lehman optimal assignment kernel (WL-OA), Graphlet kernel (GK), and Deep Graphlet kernel (DGK); as well as three graph neural networks: PATCHYSAN (PSCN) , Graph Isomorphism Network (GIN) and deep learning framework with topological signature (DL-TDA) .
To generate persistence diagram summaries, we want to put a meaningful descriptor function on input graphs. We consider two choices in our experiments: (a) the Ricci-curvature function , where is a discrete Ricci curvature for graphs as introduced in ; and (b) Jaccard-index function : In particular, the Jaccard-index of an edge in the graph is defined as , where refers to the set of neighbors of node in . The Jaccard index has been commonly used as a way to measure edge-similarity333We modify our persistence algorithm slightly to handle the edge-valued Jaccard index function. As in the case for neuron data sets, we take the union of the -th persistence diagrams induced by both the sublevel-set and the superlevel-set filtrations of the descriptor function , and convert it to a persistence image as input to our WKPI-classification framework 444We expect that using the -th zigzag persistence diagrams will provide better results. However, we choose to use only -th standard persistence as it can be easily implemented to run in time using a simple union-find data structure.. In results reported in Table 2, Ricci curvature function is used for the small chemical compounds data sets (NCI1, NCI9, PTC and MUTAG), while Jaccard function is used for the two proteins datasets (PROTEIN and DD) as well as the social/IMDB networks (IMDB’s and REDDIT’s).
The graph classification results by various methods are reported in Table 2 and 3. In particular, Table 2 compare the classification accuracy with a range of methods; while Table 3 also lists the variance of the prediction accuracy (via 10 times 10 fold cross validation). Note that not all these previous accuracy / variances in the literature are computed under the same 10 times 10 fold cross validation setup as ours. For instance, the results reported for RetGK are computed from only 10-fold cross validation. Setup for our method is the same as for Neuron data: the only difference is that if the input dataset has more than 1000 graphs, then we choose mini-batches of size to compute the gradient in each iteration. Results of other methods are taken from their respective papers. The results of DL-TDA  are not listed in the table, as only the classification accuracy for REDDIT-5K (accuracy ) and REDDIT-12K () are given in their paper (which contains more results on images as well). The comparison with other persistence-based methods can be found in Appendix B.4 (where we consistently perform the best), and we only include one of them, the SW , in this table, as it performs the best among existing persistence-based approaches.
The last two columns in Table 2 and Table 3 are our results, with WKPI-kM stands for WKPI-kmeans, and WKPI-kC for WKPI-kcenter. Except for MUTAG and IMDB-MULTI, the performances of our WKPI-framework are same or better than the best of other methods. It is important to observe that our WKPI-framework performs well on both chemical graphs and social graphs, while some of the earlier work tend to work well on one type of the graphs. Furthermore, note that the chemical / molecular graphs usually have attributes associated with them. Some existing methods use these attributes in their classification [43, 35, 44]. Our results however are obtained purely based on graph structure without using any attributes. Finally, variance speaking (see Table 3, the variances of our methods tend to be on-par with graph kernel based previous approaches; and these variances are usually much better than the GNN based approaches (i.e, PSCN and GIN).
5 Concluding remarks
This paper introduces a new weighted-kernel for persistence images (WKPI), together with a metric-learning framework to learn the best weight-function for WKPI-kernel from labelled data. Various properties of the kernel and the formulation of the optimization problem are provided. Very importantly, we apply the learned WKPI-kernel to the task of graph classification, and show that our new framework achieves similar or better results than the best results among a range of previous graph classification approaches.
In our current framework, only a single descriptor function of each input object (e.g, a graph) is used to derive a persistence-based representation. It will be interesting to extend our framework to leverage multiple descriptor functions (so as to capture different types of information) simultaneously and effectively. Recent work on multidimensional persistence would be useful in this effort. Another important question is to study how to incorporate categorical attributes associated to graph nodes (or points in input objects) effectively. Indeed, real-valued attributed can potentially be used as a descriptor function to generate persistence-based summaries. But the handling of categorical attributes via topological summarization is much more challenging, especially when there is no (prior-known) correlation between these attributes (e.g, the attribute is simply a number from , coming from categories. The indices of these categories may carry no meaning).
The authors would like to thank Chao Chen and Justin Eldridge for useful discussions related to this project. We would also like to thank Giorgio Ascoli for helping provide the neuron dataset.
-  H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research, 18:218–252, 2017.
-  L. Bai, L. Rossi, A. Torsello, and E. R. Hancock. A quantum jensen-shannon graph kernel for unattributed graphs. Pattern Recognition, 48(2):344–355, 2015.
-  U. Bauer. Ripser, 2016.
-  U. Bauer, M. Kerber, J. Reininghaus, and H. Wagner. Phat – persistent homology algorithms toolbox. In H. Hong and C. Yap, editors, Mathematical Software – ICMS 2014, pages 137–143, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
-  K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
-  P. Bubenik. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research, 16(1):77–102, 2015.
-  G. Carlsson and V. de Silva. Zigzag persistence. Foundations of Computational Mathematics, 10(4):367–405, 2010.
-  G. Carlsson, V. de Silva, and D. Morozov. Zigzag persistent homology and real-valued functions. In Proc. 25th Annu. ACM Sympos. Comput. Geom., pages 247–256, 2009.
-  G. Carlsson and A. Zomorodian. The theory of multidimensional persistence. Discrete & Computational Geometry, 42(1):71–93, 2009.
-  M. Carrière, M. Cuturi, and S. Oudot. Sliced Wasserstein kernel for persistence diagrams. International Conference on Machine Learning, pages 664–673, 2017.
-  F. Chazal, D. Cohen-Steiner, M. Glisse, L. J. Guibas, and S. Oudot. Proximity of persistence modules and their diagrams. In Proc. 25th ACM Sympos. on Comput. Geom., pages 237–246, 2009.
-  F. Chazal, V. de Silva, M. Glisse, and S. Oudot. The structure and stability of persistence modules. SpringerBriefs in Mathematics. Springer, 2016.
-  M. Clément, J.-D. Boissonnat, M. Glisse, and M. Yvinec. The gudhi library: simplicial complexes and persistent homology, 2014. url: http://gudhi.gforge.inria.fr/python/latest/index.html.
-  D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete & Computational Geometry, 37(1):103–120, 2007.
-  D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and Y. Mileyko. Lipschitz functions have Lp-stable persistence. Foundations of computational mathematics, 10(2):127–139, 2010.
-  A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
-  T. K. Dey, D. Shi, and Y. Wang. Simba: An efficient tool for approximating Rips-filtration persistence via simplicial batch-collapse. In 24th Annual European Symposium on Algorithms (ESA 2016), volume 57 of Leibniz International Proceedings in Informatics (LIPIcs), pages 35:1–35:16, 2016.
-  P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
-  H. Edelsbrunner and J. Harer. Computational Topology : an Introduction. American Mathematical Society, 2010.
-  H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28:511–533, 2002.
-  C. Helma, R. D. King, S. Kramer, and A. Srinivasan. The predictive toxicology challenge 2000–2001. Bioinformatics, 17(1):107–108, 2001.
-  S. Hido and H. Kashima. A linear-time graph kernel. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 179–188. IEEE, 2009.
-  C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In Advances in Neural Information Processing Systems, pages 1634–1644, 2017.
-  M. Kerber, D. Morozov, and A. Nigmetov. Geometry helps to compare persistence diagrams. J. Exp. Algorithmics, 22:1.4:1–1.4:20, Sept. 2017.
-  M. Kerber, D. Morozov, and A. Nigmetov. HERA: software to compute distances for persistence diagrams, 2018. URL: https://bitbucket.org/greynarn/hera.
-  M. Kerber and H. Schreiber. Barcodes of towers and a streaming algorithm for persistent homology. In 33rd International Symposium on Computational Geometry (SoCG 2017), page 57. Schloss Dagstuhl-Leibniz-Zentrum für Informatik GmbH, 2017.
-  N. M. Kriege, P. L. Giscard, and R. C. Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.
-  G. Kusano, K. Fukumizu, and Y. Hiraoka. Kernel method for persistence diagrams via kernel embedding and weight factor. Journal of Machine Learning Research, 18(189):1–41, 2018.
-  T. Le and M. Yamada. Persistence Fisher kernel: A Riemannian manifold kernel for persistence diagrams. In Advances in Neural Information Processing Systems (NIPS), pages 10028–10039, 2018.
-  R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Processing, 67(1):97–109, 2019.
-  Y. Li, D. Wang, G. A. Ascoli, P. Mitra, and Y. Wang. Metrics for comparing neuronal tree shapes based on persistent homology. PloS one, 12(8):e0182184, 2017.
-  Y. Lin, L. Lu, and S.-T. Yau. Ricci curvature of graphs. Tohoku Mathematical Journal, Second Series, 63(4):605–627, 2011.
-  H. Markram, E. Muller, S. Ramaswamy, M. W. Reimann, M. Abdellah, C. A. Sanchez, A. Ailamaki, L. Alonso-Nanclares, N. Antille, S. Arsever, et al. Reconstruction and simulation of neocortical microcircuitry. Cell, 163(2):456–492, 2015.
-  M. Neumann, N. Patricia, R. Garnett, and K. Kersting. Efficient graph kernels by randomization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 378–393. Springer, 2012.
-  M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. International conference on machine learning, pages 2014–2023, 2016.
-  S. Nino, V. SVN, P. Tobias, M. Kurt, and B. Karsten. Efficient graphlet kernels for large graph comparison. Artificial Intelligence and Statistics, pages 488–495, 2009.
-  J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multi-scale kernel for topological machine learning. In Computer Vision Pattern Recognition, pages 4741–4748, 2015.
-  D. Sheehy. Linear-size approximations to the Vietoris-Rips filtration. In Proc. 28th. Annu. Sympos. Comput. Geom., pages 239–248, 2012.
-  N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
-  S. Verma and Z.-L. Zhang. Hunt for the unique, stable, sparse and fast feature learning on graphs. Advances in Neural Information Proceeding Systems, pages 88–98, 2017.
-  K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
-  L. Xu, X. Jin, X. Wang, and B. Luo. A mixed Weisfeiler-Lehman graph kernel. In International Workshop on Graph-based Representations in Pattern Recognition, pages 242–251, 2015.
-  P. Yanardag and S. Vishwanathan. Deep graph kernels. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374, 2015.
-  Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai. Retgk: Graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems, pages 3968–3978, 2018.
Appendix A Missing details from Section 3
a.1 Proof of Theorem 3.2
Consider an arbitrary collection of persistence images (i.e, a collection of vectors in ). Set to be the kernel matrix where . Now given any vector , we have that:
Because Gaussian kernel is positive semi-definite and the weight-function is non-negative, for any . Hence the WKPI kernel is positive semi-definite.
a.2 Proof of Theorem 3.4
By Definitions 3.1 and 3.3, combined with the fact that for any , we have that:
Furthermore, by Theorem 10 of , when the distribution to in Definition 2.1 is the normalized Gaussian , and the weight function , we have that . (Intuitively, view two persistence diagrams and as two (appropriate) measures, and is then the “earth-mover” distance between them so as to convert the measure corresponding to to that for , where the cost is measured by the total -distance that all mass have to travel.) Combining this with the inequalities for above, the theorem then follows.
a.3 Proof of Theorem 3.6
We first show the following properties of matrix which will be useful for the proof later.
The matrix L is symmetric and positive semi-definite. Furthermore, for every vector , we have
By construction, it is easy to see that is symmetric as matrices and are. The positive semi-definiteness follows from Eqn (2) which we prove now.
The lemma then follows. ∎
We now prove the statement in Theorem 3.6. Recall that the definition of various matrices, and that ’s are the row vectors of matrix . For simplicity, in the derivations below, we use to denote the -induced WKPI-distance between persistence diagrams and . Applying Lemma A.1, we have:
Now by definition of , it is non-zero only when . Combined with Eqn (3), it then follows that:
This proves the first statement in Theorem 3.6. We now show that the matrix is the identity matrix . Specifically, first consider ; we claim:
It equals to because is non-zero only for , while is non-zero only for . However, for such a pair of and , obviously , which means that . Hence the sum is for all possible and ’s.
Now for the diagonal entries of the matrix , we have that for any :
This finishes the proof that , and completes the proof of Theorem 3.6.
Appendix B More details for Experiments
b.1 Description of Neuron datasets
Neuron cells have natural tree morphology (see Figure 4 (a) for an example), rooted at the cell body (soma), with dentrite and axon branching out. Furthermore, this tree morphology is important in understanding neurons. Hence it is common in the field of neuronscience to model a neuron as a (geometric) tree (see Figure 4 (b) for an example downloaded from NeuroMorpho.Org).
Our NeuBin dataset consists of 1126 neuron trees classified into two (primary) classes: interneuron and principal neurons (data partly from the Blue Brain Project  and downloaded from http://neuromorpho.org/). The second NeuMulti dataset is a refinement of the 459 interneuron class into four (secondary) classes: basket-large, basket-nest, neuglia and martino.
|Dataset||#classes||#graphs||average #nodes||average #edges|
b.2 Setup for persistence images
For each dataset, the persistence image for each object inside is computed within the rectangular bounding box of the points from all persistence diagrams of input trees. The -direction is then discretized to uniform intervals, while the -direction is discretized accordingly so that each pixel is a square. For persistence image (PI) approach of , we show results both for the unweighted persistence images (PI-CONST), and one, denoted by PI-PL, where the weight function (for Definition 2.1) is the following piecewise-linear function (modified from one proposed by Adams et al. ) where the largest persistence for any persistent-point among all persistence diagrams.
|Datasets||Existing TDA approaches||Alternative metric learning||Our WKPI framework|
b.3 Benchmark datasets for graph classification
Below we first give a brief description of the benchmark datasets we used in our experiments. These are collected from the literature.
NCI1 and NCI109  consist of two balanced subsets of datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines, respectively.
PTC  is a dataset of graph structures of chemical molecules from rats and mice which is designed for the predictive toxicology challenge 2000-2001.
DD  is a data set of 1178 protein structures. Each protein is represented by a graph, in which the nodes are amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart. They are classified according to whether they are enzymes or not.
PROTEINS  contains graphs of protein. In each graph, a node represents a secondary structure element (SSE) within protein structure, i.e. helices, sheets and turns. Edges connect nodes if they are neighbours along amino acid sequence or neighbours in protein structure space. Every node is connected to its three nearest spatial neighbours.
MUTAG  is a dataset collecting 188 mutagenic aromatic and heteroaromatic nitro compounds labelled according to whether they have a mutagenic effect on the Gramnegtive bacterium Salmonella typhimurium.
REDDIT-5K and REDDIT-12K  consist of graph representing the discussions on the online forum Reddit. In these datasets, nodes represent users and edges between two nodes represent whether one of these two users leave comments to the other or not. In REDDIT-5K, graphs are collected from 5 sub-forums, and they are labelled by to which sub-forums they belong. In REDDIT-12K, there are 11 sub-forums involved, and the labels are similar to those in REDDIT-5K.
IMDB-BINARY and IMDB-MULTI  are dataset consists of networks of 1000 actors or actresses who played roles in movies in IMDB. In each graph, a node represents an actor or actress, and an edge connects two nodes when they appear in the same movie. In IMDB-BINARY, graphs are classified into Action and Romance genres. In IMDB-MULTI, they are collected from three different genres: Comedy, Romance and Sci-Fi.
The statistics of these datasets are provided in Table LABEL:tbl:datastats.
b.4 Topological-based methods on graph data
Here we compare our WKPI-framework with the performance of several state-of-the-art persistence-based classification frameworks, including: PWGK , SW , PI . We also compare it with two alternative ways to learn the metric for persistence-based representations: trainPWGK is the version of PWGK  where we learn the weight function in its formulation, using the same cost-function as what we propose in this paper for our WKPI kernel functions. altWKPI is the alternative formulation of a kernel for persistence images where we set the kernel to be , instead of our WKPI-kernel as defined in Definition 3.1. We use the same setup as our WKPI-framework to train these two metrics, and use their resulting kernels for SVM to classify the benchmark graph datasets. WKPI-framework outperforms the existing approaches and alternative metric learning methods on all datasets except MUTAG. WKPI-kM (i.e, WKPI-kmeans) and WKPI-kC (i.e, WKPI-kcenter) improve the accuracy by and , respectively. The classification accuracy of all these methods are reported in Table LABEL:tbl:persbenchmark.