Learning metrics for persistencebased summaries and applications for graph classification
Abstract
Recently a new feature representation and data analysis methodology based on a topological tool called persistent homology (and its corresponding persistence diagram summary) has started to attract momentum.
A series of methods have been developed to map a persistence diagram to a vector representation so as to facilitate the downstream use of machine learning tools, and in these approaches, the importance (weight) of different persistence features are often preset. However often in practice, the choice of the weightfunction should depend on the nature of the specific type of data one considers, and it is thus highly desirable to learn a best weightfunction (and thus metric for persistence diagrams) from labelled data. We study this problem and develop a new weighted kernel, called WKPI, for persistence summaries, as well as an optimization framework to learn a good metric for persistence summaries. Both our kernel and optimization problem have nice properties. We further apply the learned kernel to the challenging task of graph classification, and show that our WKPIbased classification framework obtains similar or (sometimes significantly) better results than the best results from a range of previous graph classification frameworks on a collection of benchmark datasets.
1 Introduction
In recent years a new data analysis methodology based on a topological tool called persistent homology has started to attract momentum in the learning community. The persistent homology is one of the most important developments in the field of topological data analysis in the past two decades, and there have been fundamental development both on the theoretical front (e.g, [20, 9, 11, 7, 12]), and on algorithms / efficient implementations (e.g, [38, 4, 13, 17, 26, 3]). On the high level, given a domain with a function defined on it, the persistent homology provides a way to summarize “features” of across multiple scales simultaneously in a single summary called the persistence diagram (see the lower left picture in Figure 1). A persistence diagram consists of a multiset of points in the plane, where each point intuitively corresponds to the birthtime () and deathtime () of some (topological) feature of w.r.t. . Hence it provides a concise representation of , capturing multiscale features of it in a concise manner. Furthermore, the persistent homology framework can be applied to complex data (e.g, 3D shapes, or graphs), and different summaries could be constructed by putting different descriptor functions on input data.
Due to these reasons, a new persistencebased feature vectorization and data analysis framework (see Figure 1) has recently attracted much attention. Specifically, given a collection of objects, say a set of graphs modeling chemical compounds, one can first convert each shape to a persistencebased representation. The input data can now be viewed as a set of points in a certain persistencebased feature space. Equipping this space with appropriate distance or kernel, one can then perform downstream data analysis tasks, such as clustering or classification.
The original distances for persistence diagram summaries unfortunately do not lend themselves easily to machine learning tasks. Hence in the last few years, starting from the persistence landscape [6], there have been a series of methods developed to map a persistence diagram to a vector representation to facilitate machine learning tools. Recent ones include Persistence ScaleSpace kernel [37], Persistence Images [1], Persistence Weighted Gaussian kernel (PWGK) [28], Sliced Wasserstein kernel [10], and Persistence Fisher kernel [29].
In these approaches, when computing the distance or kernel between persistence summaries, the importance (weight) of different persistence features are often predetermined (e.g, either with uniform weights, or weighted by its persistence). In persistence images [1] and PWGK [28], the importance of having a weightfunction for the birthdeath plane (containing the persistence points) has been emphasized and explicitly included in the formulation of their kernels. However, before using these kernels, the weightfunction needs to be preset.
On the other hand, as recognized by [23], the choice of the weightfunction should depend on the nature of the specific type of data one considers. For example, for the persistence diagrams computed from atomic configurations of molecules, features with small persistence could capture the local packing patterns which are of utmost importance and thus should be given a larger weight; while in many other scenarios, small persistence typically leads to noise with low importance. However, in general researchers performing data analysis tasks may not have such prior insights on input data. Thus it is natural and highly desirable to learn a best weightfunction from labelled data.
Our work and contributions.
In this paper, we study the problem of learning an appropriate metric (kernel) for persistencebased summaries from labelled data, as well as applying the learnt kernel to the challenging graph classification task. Our contributions are twofold.
 Metric learning for persistence summaries:

We propose a new weightedkernel (called WKPI), for persistence summaries based on persistence images representations. Our new WKPI kernel is positive semidefinite and its induced distance has stability property. A weightfunction used in this kernel can directly encodes the importance of different locations in the persistence diagram. We next model the metric learning problem for persistence summaries as the problem of learning (the parameters of) this weightfunction from a certain function class. In particular, we develop a cost function and the metriclearning is then formulated as an optimization problem. Interestingly, we show that this cost function has a simple matrix view which helps to both conceptually clarify its meaning and simplify the implementation of its optimization.
 Graph classification application:

Given a set of objects with class labels, we first learn a best WKPIkernel as described above, and then use the learned WKPI to further classify objects. We implemented this WKPIclassification framework, and apply it to a range of graph data sets. Graph classification is an important problem, and there has been a large literature on developing effective graph representations (e.g, [22, 36, 2, 27, 39, 42, 34]) and graph neural networks (e.g, graph neural networks [43, 35, 41, 40, 30]) for classifying graphs. The problem is challenging as graph data are less structured and more complex. We perform our WKPIclassification framework on a range of benchmark graph data sets as well as new data sets (modeling neuron morphologies). We show that the learned WKPI is consistently much more effective than other persistencebased kernels. Most importantly, when compared with several existing stateoftheart graph classification frameworks, our framework shows similar or (sometimes significantly) better performance in almost all cases than the best results by existing approaches ^{1}^{1}1Several datasets are attributed graphs, and our new framework achieves these results without even using those attributes.. Given the importance of the graph classification problem, we believe that this is an independent and important contribution of our work.
We note that [23] is the first to recognize the importance of using labelled data to learn a taskoptimal representation of topological signatures. They developed an endtoend deep neural network for this purpose, using a novel and elegant design of the input layer to implicitly learn a taskspecific representation. We instead explicitly formulate the metriclearning problem for persistencesummaries, and decouple the metriclearning (which can also be viewed as representationlearning) component from the downstream data analysis tasks. Also as shown in Section 4, our WKPIclassification framework (using SVM) achieves better results on graph classification datasets.
(a)  (b)  (c) 
2 Persistencebased framework
We first give an informal description of persistent homology below. See [19] for more detailed exposition on the subject.
Suppose we are given a shape (in our later graph classification application, is a graph). Imagine we inspect through a filtration of , which is a sequence of growing subsets of : . As we scan , sometimes a new feature appears in , and sometimes an existing feature disappears upon entering . Using the topological object called homology classes to describe these features (which intuitively capture components, independent loops / voids, and their high dimensional counterparts), the birth and death of topological features can then be captured by the persistent homology, in the form of a persistence diagram . Specifically, consists of a multiset of points in the plane (which we call the birthdeath plane ), where each point () in it, called a persistencepoint, indicates that a certain homological feature is created upon entering and destroyed upon entering . A common way to obtain a meaningful filtration of is via the sublevelset filtration induced by a descriptor function on . More specifically, given a function , let be its sublevelset at . Let be real values. The sublevelset filtration w.r.t. is: and its persistence diagram is denoted by . Each persistencepoint indicates the function values when some topological features are created (when entering ) and destroyed (in ), and the persistence of this feature is its lifetime . See Figure 2 (a) for a simple example where . If one sweep topdown in decreasing function values, one gets the persistence diagram induced by the superlevelset filtration of w.r.t. in an analogous way. Finally, if one tracks the change of topological features in the levelset , one obtains the socalled levelset zigzag persistence [8].
The persistent homology provides a generic yet powerful way to summarize a space . Even when the space is complex, say a graph, we can still map it to a persistence diagram via appropriate descriptor functions. Furthermore, a different descriptor function provides a different perspective of , and its persistence diagram summarizes features of at all scales w.r.t. this perspective.
If we are given a collection of shapes , we can compute a persistence diagram for each , which maps the set to a set of points in the space of persistence diagrams. There are natural distances defined for persistence diagrams, including the bottleneck distance and the Wasserstein distance, both of which have been well studied (e.g, stability under these distances [14, 15, 12]) with efficient implementations available [24, 25]. However, to facilitate downstream machine learning / data analysis tasks, it is desirable to further map the persistence diagrams to another representation (e.g, in a Hilbert space). Below we introduce one such representation, called the persistence images [1], as our new kernel is based on it.
Persistence images.
Let be a persistent diagram (containing a multiset of persistencepoints). Set to be the linear transformation where for each , . Let be the transformed diagram of . Let be a differentiable probability distribution with mean (e.g, the normalized Gaussian where for any , .
Definition 2.1 ([1])
Let be a nonnegative weightfunction for the persistent plane . Given a persistent diagram P, its persistence surface (w.r.t. ) is defined as: for any ,
See Figure 2 (b) for an example. Adams et al. further “discretize” the 2D persistence surface to map it to a finite vector. In particular, fix a grid on a rectangular region in the plane with a collection of rectangles (pixels).
Definition 2.2 ([1])
Given a persistence diagram , its persistence image consists of numbers, one for each pixel in the grid with
The persistence image can be viewed as a vector in . One can then compute distance between two persistence diagrams and by the distance between their persistence images (vectors) and . The persistence images have several nice properties, including stability guarantees; see [1] for more details.
3 Metric learning frameworks
Suppose we are given a set of objects (sampled from a hidden data space ), classified into classes. We want to use these labelled data to learn a good distance for (persistenceimage representations of) objects from which hopefully is more appropriate at classifying objects in the data space . To do so, below we propose a new persistencebased kernel for persistence images, and then formulate an optimization problem to learn the best weightfunction so as to obtain a good distance metric for (and data space ).
3.1 Weighted persistence image kernel (WKPI)
From now on, we fix the grid (of size ) to generate persistence images (so a persistence image is a vector in ). Let denote the center of the th pixel in , for . We now introduce a new kernel for persistence images. A weightfunction refers to a nonnegative realvalued function defined on .
Definition 3.1
Let be a weightfunction. Given two persistence images and , the ()weighted persistence image kernel (WKPI) is defined as:
(1) 
Remark 1: We could use the persistence surfaces (instead of persistence images) to define the kernel (with the summation replaced by an integral). Since for computational purpose, one still needs to approximate the integral in the kernel via some discretization, we choose to present our work using persistenceimages directly. Our Theorems 3.2 and 3.4 still hold (with slightly different stability bound) if we use the kernel defined for persistence surfaces.
Remark 2: One can choose the weightfunction from different function classes. Two popular choices are: mixture of 2D Gaussians; and degree polynomials on two variables.
Remark 3: There are other natural choices for defining a weighted kernel for persistence images. For example, we could use , which we refer this as altWKPI. Alternatively, one could use the weight function used in the persistentweighted Gaussian kernel (PWGK) [28] directly. Indeed, we have implemented all these choices, and our experiments show that our WKPI kernel leads to better results than these choices for all datasets (see Appendix B.4). Furthermore, we note that the square of our WKPIdistance depends on linearly, which is much simpler when computing gradients for our cost function later when using the chainrule. In addition, note that PWGK kernel [28] contains cross terms in its formulation, meaning that there are quadratic number of terms (w.r.t the number of persistence points) to calculate the kernel. This makes it more expensive to compute and learn for complex objects (e.g, for the neuron data set, a single neuron tree could produce a persistence diagrams with hundreds persistence points).
Theorem 3.2
The WKPI kernel is positive semidefinite.
By the above result, the WKPI kernel gives rise to a Hilbert space. We can now introduce the following WKPIdistance induced by the inner product on this Hilbert space.
Definition 3.3
Given two persistence diagrams and , let and be their corresponding persistenceimages. Given a weightfunction , the (weighted) WKPIdistance is defined as:
Stability of WKPIdistance.
Given two persistence diagrams and , two traditional distances between them are the bottleneck distance and the th Wasserstein distance . Stability of these two distances w.r.t. changes of input objects or functions defined on them have been studied [14, 15, 12]. Similar to the stability study on persistence images, below we prove WKPIdistance is stable w.r.t. small perturbation in persistence diagrams as measured by . (Intuitively, view two persistence diagrams and as two (appropriate) measures, and is then the “earthmover” distance between them so as to convert the measure corresponding to to that for .)
To simplify the presentation of Theorem 3.4, we use unweighted persistence images w.r.t. Gaussian, meaning in Definition 2.1, (1) the weight function is the constant function ; and (2) the distribution is the Gaussian . The proof of the following theorem can be found in Appendix A.2.
Theorem 3.4
Remarks: We can obtain a more general bound for the case where the distribution is not Gaussian. Furthermore, we can obtain a similar bound when our WKPIkernel and its induced WKPIdistance is defined using persistence surfaces instead of persistence images. We omit these from this short version of the paper.
3.2 Optimization problem for metriclearning
Suppose we are given a collection of objects (sampled from some hidden data space ), already classified (labeled) to classes . In what follows, we say that if has classlabel . We first compute the persistence diagram for each object . (The precise filtration we use to do so will depend on the specific type of objects. Later in Section 4, we will describe filtrations used for graph data). Let be the resulting set of persistence diagrams. Given a weightfunction , its induced WKPIdistance between and can also be thought of as a distance for the original objects and ; that is, we can set . Our goal is to learn a good distance metric for the data space (where are sampled from) from the labels. We will formulate this as learning a best weightfunction so that its induced WKPIdistance fits the classlabels of ’s best. Specifically, for any , set:
Intuitively, is the total inclass (square) distances for ; while is the total distance from objects in class to all objects in . A good metric should lead to relatively smaller distance between objects from the same class, but larger distance between objects from different classes. We thus propose the following optimization problem:
Definition 3.5 (Optimization problem)
Given a weightfunction , the totalcost of its induced WKPIdistance over is defined as:
The optimal distance problem aims to find the best weightfunction from a certain function class so that the totalcost is minimized; that is:
Matrix view of optimization problem.
We observe that our cost function can be reformulated into a matrix form. This provides us with a perspective from the Laplacian matrix of certain graphs to understand the cost function, and helps to simplify the implementation of our optimization problem, as several programming languages popular in machine learning (e.g Python and Matlab) handle matrix operations more efficiently (than using loops).
More precisely, recall our input is a set of objects with labels from classes. We set up the following matrices:
If we view as a distance matrix of objects , is then its Laplacian matrix. The technical proof of the following main theorem can be found in Appendix A.3.
Theorem 3.6
The totalcost can also be represented by , where is the trace of a matrix. Furthermore, , where is the identity matrix.
Note that all matrices, and , are dependent on the (parameters of) weightfunction , and in the following corollary of Theorem 3.6, we use the subscript of to emphasize this dependence.
Corollary 3.7
Optimal distance problem is equivalent to
Solving the optimization problem.
In our implementation, we use (stochastic) gradient decent to find a (locally) optimal weightfunction for the minization problem. We use the matrix view as given in Corollary 3.7, and minimizing subject to . We briefly describe our procedure where we assume that the weightfunction is from the class of mixture of number of 2D nonnegatively weighted (spherical) Gaussians. Each weightfunction is thus determined by parameters with .
From the proof of Theorem 3.6 (in the appendix), it turns out that condition is satisfied as long as the multiplicative weight of each Gaussian in the mixture is nonnegative. Hence during the gradient descent method, we only need to make sure that this holds ^{2}^{2}2 In our implementation, we add a penalty term to totalcost , to achieve this in a “soft” manner.. It is easy to write out the gradient of w.r.t. each parameter in matrix form. For example,
While this does not improve the asymptotic complexity of computing the gradient (compared to using the formulation of cost function in Definition 3.5), programming languages such as Python and Matlab can implement these matrix operations much more efficiently than using loops. For large data sets, one can use stochastic gradient decent, by sampling a subset of number of input persistence images, and compute the matrices as well as the cost using the subsampled data points. In our implementation, we use ArmijoGoldstein line search scheme to update the parameters in each (stochastic) gradient decent step. The optimization procedure terminates when the cost function converges or the number of iterations exceeds a threshold.
4 Experiments
In this section, we show the effectiveness of our metriclearning framework and the usefulness of the learned metric via graph classification applications. In particular, given a set of graphs coming from classes, we first compute the unweighted persistence images for each graph , and apply the framework from Section 3.1 to learn the “best” weightfunction on the birthdeath plane from these persistence images . Next, we perform graph classification using kernelSVM with the learned WKPI kernel. We refer to this framework as WKPIclassification framework.
We show two family of experiments below. In section 4.1 we show that our learned WKPI kernel significantly outperforms existing persistencebased representations. In Section 4.2, we compare the performance of WKPIclassification framework with various stateoftheart methods for the graph classification task over a range of data sets.
Setup for our WKPIbased framework.
In all our experiments, we assume that the weightfunction comes from the class of mixture of 2D nonnegatively weighted Gaussians as described in the end of Section 3.2. Furthermore, all Gaussians are isotropic with the same standard deviation (width) . We take and this width as hyperparameters: Specifically, we search among and and determine their final choices via 10 times 10 fold cross validation. We repeat the process 10 times that spliting each dataset into 10 folds, and performing 10fold cross validation. In each 10fold cross validation, 9 folds are used for training and 1 for testing, and we repeat the 9:1 traintest experiments 10 times.
One important question is to initialize the centers of the Gaussians in our mixture. There are three strategies that we consider. (1) We simply sample centers in the domain of persistence images randomly. (2) We collect all points in the persistence diagrams derived from the training data , and perform a kmeans algorithm to identify means. (3) We perform a kcenter algorithm to those points to identify centers. Strategies (2) and (3) usually outperform strategy (1). Thus in what follows we only report results from using kmeans and kcenters as initialization, referred to as WKPIkM and WKPIkC, respectively.
Datasets  Existing approaches  Alternative metric learning  Our WKPI framework  

PWGK  SW  PIPL  altWKPI  trainPWGK  WKPIkmeans  WKPIkcenters  
NEURONBINARY  80.1  85.1  84.1  82.7  84.9  90.3  86.5 
NEURONMULTI  45.5  57.3  44.3  55.4  49.2  56.2  69.1 
Average  62.80  71.20  64.20  69.05  67.05  73.50  77.80 
4.1 Comparison with other persistencebased methods
We compare our methods with stateoftheart persistencebased representations, including the Persistence Weighted Gaussian Kernel (PWGK) [28], original Persistence Image (PI) [1], and Sliced Wasserstein (SW) Kernel [10]. Furthermore, as mentioned in Remark 3 after Definition 3.1, we can learn weight functions in PWGK by the optimizing the same cost function (via replacing our WKPIdistance with the one computed from PWGK kernel); and we refer to this as trainPWGK. We can also use an alternative kernel for persistence images as described in Remark 3, and then optimize the same cost function using distance computed from this kernel; we refer to this as altWKPI. We will compare our methods both with existing approaches, as well as with these two alternative metriclearning approaches (trainPWGK and altWKPI).
Neuron datasets.
Neuron cells have natural tree morphology, rooted at the cell body (soma), with dendrite and axon branching outm, and it is common in the field of neuronscience to model a neuron as a (geometric) tree. See Figure 4 in the appendix for an example. Our NeuronBinary dataset consists of 1126 neuron trees classified into two (primary) classes: interneuron and principal neurons (data partly from the Blue Brain Project [33] and downloaded from(http://neuromorpho.org/). The second NeuronMulti dataset is a refinement of the 459 neurons from the interneuron class into four (secondary) classes – hence neurons in dataset NeuronMulti all come from one class of dataset NeuronBinary.
Generation of persistence.
Given a neuron tree , following [31], we use the descriptor function where is the geodesic distance from to the root of along the tree. In NeuronMulti, to differentiate the dendrite and axon part of a neuron cell, we further negate the function value if a point is in the dendrite. We then use the union of persistence diagrams induced by both the sublevelset and superlevelset filtrations w.r.t. . Under these filtrations, intuitively, each point in the birthdeath plane corresponds to the creation and death of certain branch feature for the input neuron tree. The set of persistence diagrams obtained this way (one for each neuron tree) is the input to our WKPIclassification framework.
Results on neuron datasets.
The classification accuracy of various methods is given in Table 1. To obtain these results, we split the training cases and testing cases with the ratio 1:1 for both datasets. As the number of trees is not large, we use all training data to compute the gradients for cost functions in the optimization process instead of minibatch sampling. Our optimization procedure terminates when the change of the cost function remains smaller than or the iteration number exceeds 2000. Persistenceimages are both needed for the methodology of [1] and as input for our WKPIdistance, and its resolution is fixed at roughly (see Appendix B.2 for details). For persistence image (PI) approach of [1], we experimented both with the unweighted persistence images (PICONST), and one, denoted by (PIPL), where the weight function is a simple piecewiselinear (PL) function adapted from what’s proposed in [1]; see Appendix B.2 for details. Since PIPL performs better than PICONST on both datasets, Table 1 only shows the results of PIPL.
Note that while our WKPIframework is based on persistence images (PI), our classification accuracy is much better. We also point out that in our results here as well as later for graph classification task, our method consistently outperforms all other persistencebased representations, often by a large margin; see Appendix B.4 for comparison of our methods with these existing persistencebased frameworks on graph classification.
In Figure 3 we show the heatmap of the learned weightfunction for both datasets. Interestingly, we note that the important branching features (points in the birthdeath plane with high values) separating the two primary classes (i.e, for NeuronBinary dataset) is different from those important for classifying neurons from one of the two primary classes (the interneuron class) into the four secondary classes (i.e, the NeuronMulti dataset). Also high importance (weight) points may not have high persistence. In the future, it would be interesting to investigate whether the important branch features are also biochemically important.
NeuronBinary  NeuronMulti 
4.2 Graph classification task
Dataset  Previous approaches  Our appraches  

RetGK  WL  WLOA  GK  DGK  FGSD  PSCN  GIN  SW  WKPIkM  WKPIkC  
NCI1  84.5  85.4  86.1  62.3  80.3  79.8  76.3  82.7  80.1  87.2  84.7 
NCI109    84.5  86.3  66.6  80.3  78.8      75.5  85.6  87.3 
PTC  62.5  55.4  63.6  57.3  60.1  62.8  62.3  66.6  64.5  63.1  67.1 
PROTEIN  75.8  71.2  76.4  71.7  75.7  72.4  75.0  76.2  76.4  78.8  74.9 
DD  81.6  78.6  79.2  78.5    77.1  76.2    78.9  82.0  80.3 
MUTAG  90.3  84.4  84.5  81.6  87.4  92.1  89  90  87.1  86.9  87.5 
IMDBBINARY  71.9  70.8    65.9  67.0  71.0  71.0  75.1  69.6  70.7  75.4 
IMDBMULTI  47.7  49.8    43.9  44.6  45.2  45.2  52.3  48.7  46.4  49.5 
REDDIT5K  56.1  51.2    41.0  41.3  47.8  49.1  57.5  53.8  58.5  60.2 
REDDIT12K  48.7  32.6    31.8  32.2    41.3    48.3  47.7  48.6 
Average    66.39    60.06          68.29  70.62  71.65 
Dataset  Previous approaches  Our appraches  

RetGK  WL  GK  DGK  PSCN  GIN  WKPIkM  WKPIkC  
NCI1  84.50.2  85.40.3  62.30.3  80.30.5  76.31.7  82.71.6  87.20.4  84.70.4 
NCI109    84.50.2  66.60.2  80.30.3      85.60.3  87.30.3 
PTC  62.51.6  55.41.5  57.31.1  60.12.5  62.35.7  66.66.9  63.12.4  67.12.2 
PROTEIN  75.80.6  71.20.8  71.70.6  75.70.5  75.02.5  76.22.6  78.80.4  74.90.3 
DD  81.60.3  78.60.4  78.50.3    76.22.6    82.00.5  80.30.4 
MUTAG  90.31.1  84.41.5  81.62.1  87.42.7  89.04.4  90.08.8  86.92.5  87.52.6 
IMDBBINARY  71.91.0  70.80.5  65.91.0  67.00.6  71.02.3  75.15.1  70.71.1  75.41.1 
IMDBMULTI  47.70.3  49.80.5  43.90.4  44.60.4  45.22.8  52.3 2.8  46.40.5  49.50.4 
REDDIT5K  56.10.5  51.20.3  41.00.2  41.30.2  49.10.7  57.51.5  58.50.4  60.20.6 
REDDIT12K  48.70.2  32.60.3  31.80.1  32.20.1  41.30.4    47.70.5  48.60.5 
Benchmark datasets and comparison methods.
We use a range of benchmark datasets (collected from recent literature) including: (1) several datasets on graphs derived from small chemical compounds or protein molecules: NCI1 and NCI109 [39], PTC [21], PROTEIN [5], DD [18] and MUTAG [16]; (2) two datasets on graphs representing the response relations between users in Reddit: REDDIT5K (5 classes) and REDDIT12K (11 classes) [43]; and (3) two datasets on IMDB networks of actors/actresses: IMDBBINARY (2 classes), and IMDBMULTI (3 classes). See Appendix B.3 for descriptions of these datasets, and their statistics (sizes of graphs etc).
Many graph classification methods have been proposed in the literature, with different methods performing better on different datasets. Thus we include a large number of approaches to compare with: six graphkernel based approaches: RetGK[44], FGSD[40], WeisfeilerLehman kernel (WL)[39], WeisfeilerLehman optimal assignment kernel (WLOA)[27], Graphlet kernel (GK)[34], and Deep Graphlet kernel (DGK)[43]; as well as three graph neural networks: PATCHYSAN (PSCN) [35], Graph Isomorphism Network (GIN)[41] and deep learning framework with topological signature (DLTDA) [23].
Persistence generation.
To generate persistence diagram summaries, we want to put a meaningful descriptor function on input graphs. We consider two choices in our experiments: (a) the Riccicurvature function , where is a discrete Ricci curvature for graphs as introduced in [32]; and (b) Jaccardindex function : In particular, the Jaccardindex of an edge in the graph is defined as , where refers to the set of neighbors of node in . The Jaccard index has been commonly used as a way to measure edgesimilarity^{3}^{3}3We modify our persistence algorithm slightly to handle the edgevalued Jaccard index function. As in the case for neuron data sets, we take the union of the th persistence diagrams induced by both the sublevelset and the superlevelset filtrations of the descriptor function , and convert it to a persistence image as input to our WKPIclassification framework ^{4}^{4}4We expect that using the th zigzag persistence diagrams will provide better results. However, we choose to use only th standard persistence as it can be easily implemented to run in time using a simple unionfind data structure.. In results reported in Table 2, Ricci curvature function is used for the small chemical compounds data sets (NCI1, NCI9, PTC and MUTAG), while Jaccard function is used for the two proteins datasets (PROTEIN and DD) as well as the social/IMDB networks (IMDB’s and REDDIT’s).
Classification results.
The graph classification results by various methods are reported in Table 2 and 3. In particular, Table 2 compare the classification accuracy with a range of methods; while Table 3 also lists the variance of the prediction accuracy (via 10 times 10 fold cross validation). Note that not all these previous accuracy / variances in the literature are computed under the same 10 times 10 fold cross validation setup as ours. For instance, the results reported for RetGK are computed from only 10fold cross validation. Setup for our method is the same as for Neuron data: the only difference is that if the input dataset has more than 1000 graphs, then we choose minibatches of size to compute the gradient in each iteration. Results of other methods are taken from their respective papers. The results of DLTDA [23] are not listed in the table, as only the classification accuracy for REDDIT5K (accuracy ) and REDDIT12K () are given in their paper (which contains more results on images as well). The comparison with other persistencebased methods can be found in Appendix B.4 (where we consistently perform the best), and we only include one of them, the SW [10], in this table, as it performs the best among existing persistencebased approaches.
The last two columns in Table 2 and Table 3 are our results, with WKPIkM stands for WKPIkmeans, and WKPIkC for WKPIkcenter. Except for MUTAG and IMDBMULTI, the performances of our WKPIframework are same or better than the best of other methods. It is important to observe that our WKPIframework performs well on both chemical graphs and social graphs, while some of the earlier work tend to work well on one type of the graphs. Furthermore, note that the chemical / molecular graphs usually have attributes associated with them. Some existing methods use these attributes in their classification [43, 35, 44]. Our results however are obtained purely based on graph structure without using any attributes. Finally, variance speaking (see Table 3, the variances of our methods tend to be onpar with graph kernel based previous approaches; and these variances are usually much better than the GNN based approaches (i.e, PSCN and GIN).
5 Concluding remarks
This paper introduces a new weightedkernel for persistence images (WKPI), together with a metriclearning framework to learn the best weightfunction for WKPIkernel from labelled data. Various properties of the kernel and the formulation of the optimization problem are provided. Very importantly, we apply the learned WKPIkernel to the task of graph classification, and show that our new framework achieves similar or better results than the best results among a range of previous graph classification approaches.
In our current framework, only a single descriptor function of each input object (e.g, a graph) is used to derive a persistencebased representation. It will be interesting to extend our framework to leverage multiple descriptor functions (so as to capture different types of information) simultaneously and effectively. Recent work on multidimensional persistence would be useful in this effort. Another important question is to study how to incorporate categorical attributes associated to graph nodes (or points in input objects) effectively. Indeed, realvalued attributed can potentially be used as a descriptor function to generate persistencebased summaries. But the handling of categorical attributes via topological summarization is much more challenging, especially when there is no (priorknown) correlation between these attributes (e.g, the attribute is simply a number from , coming from categories. The indices of these categories may carry no meaning).
Acknowledgement.
The authors would like to thank Chao Chen and Justin Eldridge for useful discussions related to this project. We would also like to thank Giorgio Ascoli for helping provide the neuron dataset.
References
 [1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research, 18:218–252, 2017.
 [2] L. Bai, L. Rossi, A. Torsello, and E. R. Hancock. A quantum jensenshannon graph kernel for unattributed graphs. Pattern Recognition, 48(2):344–355, 2015.
 [3] U. Bauer. Ripser, 2016.
 [4] U. Bauer, M. Kerber, J. Reininghaus, and H. Wagner. Phat – persistent homology algorithms toolbox. In H. Hong and C. Yap, editors, Mathematical Software – ICMS 2014, pages 137–143, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
 [5] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
 [6] P. Bubenik. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research, 16(1):77–102, 2015.
 [7] G. Carlsson and V. de Silva. Zigzag persistence. Foundations of Computational Mathematics, 10(4):367–405, 2010.
 [8] G. Carlsson, V. de Silva, and D. Morozov. Zigzag persistent homology and realvalued functions. In Proc. 25th Annu. ACM Sympos. Comput. Geom., pages 247–256, 2009.
 [9] G. Carlsson and A. Zomorodian. The theory of multidimensional persistence. Discrete & Computational Geometry, 42(1):71–93, 2009.
 [10] M. Carrière, M. Cuturi, and S. Oudot. Sliced Wasserstein kernel for persistence diagrams. International Conference on Machine Learning, pages 664–673, 2017.
 [11] F. Chazal, D. CohenSteiner, M. Glisse, L. J. Guibas, and S. Oudot. Proximity of persistence modules and their diagrams. In Proc. 25th ACM Sympos. on Comput. Geom., pages 237–246, 2009.
 [12] F. Chazal, V. de Silva, M. Glisse, and S. Oudot. The structure and stability of persistence modules. SpringerBriefs in Mathematics. Springer, 2016.
 [13] M. Clément, J.D. Boissonnat, M. Glisse, and M. Yvinec. The gudhi library: simplicial complexes and persistent homology, 2014. url: http://gudhi.gforge.inria.fr/python/latest/index.html.
 [14] D. CohenSteiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete & Computational Geometry, 37(1):103–120, 2007.
 [15] D. CohenSteiner, H. Edelsbrunner, J. Harer, and Y. Mileyko. Lipschitz functions have Lpstable persistence. Foundations of computational mathematics, 10(2):127–139, 2010.
 [16] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
 [17] T. K. Dey, D. Shi, and Y. Wang. Simba: An efficient tool for approximating Ripsfiltration persistence via simplicial batchcollapse. In 24th Annual European Symposium on Algorithms (ESA 2016), volume 57 of Leibniz International Proceedings in Informatics (LIPIcs), pages 35:1–35:16, 2016.
 [18] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
 [19] H. Edelsbrunner and J. Harer. Computational Topology : an Introduction. American Mathematical Society, 2010.
 [20] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28:511–533, 2002.
 [21] C. Helma, R. D. King, S. Kramer, and A. Srinivasan. The predictive toxicology challenge 2000–2001. Bioinformatics, 17(1):107–108, 2001.
 [22] S. Hido and H. Kashima. A lineartime graph kernel. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 179–188. IEEE, 2009.
 [23] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In Advances in Neural Information Processing Systems, pages 1634–1644, 2017.
 [24] M. Kerber, D. Morozov, and A. Nigmetov. Geometry helps to compare persistence diagrams. J. Exp. Algorithmics, 22:1.4:1–1.4:20, Sept. 2017.
 [25] M. Kerber, D. Morozov, and A. Nigmetov. HERA: software to compute distances for persistence diagrams, 2018. URL: https://bitbucket.org/greynarn/hera.
 [26] M. Kerber and H. Schreiber. Barcodes of towers and a streaming algorithm for persistent homology. In 33rd International Symposium on Computational Geometry (SoCG 2017), page 57. Schloss DagstuhlLeibnizZentrum für Informatik GmbH, 2017.
 [27] N. M. Kriege, P. L. Giscard, and R. C. Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.
 [28] G. Kusano, K. Fukumizu, and Y. Hiraoka. Kernel method for persistence diagrams via kernel embedding and weight factor. Journal of Machine Learning Research, 18(189):1–41, 2018.
 [29] T. Le and M. Yamada. Persistence Fisher kernel: A Riemannian manifold kernel for persistence diagrams. In Advances in Neural Information Processing Systems (NIPS), pages 10028–10039, 2018.
 [30] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Processing, 67(1):97–109, 2019.
 [31] Y. Li, D. Wang, G. A. Ascoli, P. Mitra, and Y. Wang. Metrics for comparing neuronal tree shapes based on persistent homology. PloS one, 12(8):e0182184, 2017.
 [32] Y. Lin, L. Lu, and S.T. Yau. Ricci curvature of graphs. Tohoku Mathematical Journal, Second Series, 63(4):605–627, 2011.
 [33] H. Markram, E. Muller, S. Ramaswamy, M. W. Reimann, M. Abdellah, C. A. Sanchez, A. Ailamaki, L. AlonsoNanclares, N. Antille, S. Arsever, et al. Reconstruction and simulation of neocortical microcircuitry. Cell, 163(2):456–492, 2015.
 [34] M. Neumann, N. Patricia, R. Garnett, and K. Kersting. Efficient graph kernels by randomization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 378–393. Springer, 2012.
 [35] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. International conference on machine learning, pages 2014–2023, 2016.
 [36] S. Nino, V. SVN, P. Tobias, M. Kurt, and B. Karsten. Efficient graphlet kernels for large graph comparison. Artificial Intelligence and Statistics, pages 488–495, 2009.
 [37] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multiscale kernel for topological machine learning. In Computer Vision Pattern Recognition, pages 4741–4748, 2015.
 [38] D. Sheehy. Linearsize approximations to the VietorisRips filtration. In Proc. 28th. Annu. Sympos. Comput. Geom., pages 239–248, 2012.
 [39] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. WeisfeilerLehman graph kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
 [40] S. Verma and Z.L. Zhang. Hunt for the unique, stable, sparse and fast feature learning on graphs. Advances in Neural Information Proceeding Systems, pages 88–98, 2017.
 [41] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
 [42] L. Xu, X. Jin, X. Wang, and B. Luo. A mixed WeisfeilerLehman graph kernel. In International Workshop on Graphbased Representations in Pattern Recognition, pages 242–251, 2015.
 [43] P. Yanardag and S. Vishwanathan. Deep graph kernels. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374, 2015.
 [44] Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai. Retgk: Graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems, pages 3968–3978, 2018.
Appendix A Missing details from Section 3
a.1 Proof of Theorem 3.2
Consider an arbitrary collection of persistence images (i.e, a collection of vectors in ). Set to be the kernel matrix where . Now given any vector , we have that:
Because Gaussian kernel is positive semidefinite and the weightfunction is nonnegative, for any . Hence the WKPI kernel is positive semidefinite.
a.2 Proof of Theorem 3.4
By Definitions 3.1 and 3.3, combined with the fact that for any , we have that:
Furthermore, by Theorem 10 of [1], when the distribution to in Definition 2.1 is the normalized Gaussian , and the weight function , we have that . (Intuitively, view two persistence diagrams and as two (appropriate) measures, and is then the “earthmover” distance between them so as to convert the measure corresponding to to that for , where the cost is measured by the total distance that all mass have to travel.) Combining this with the inequalities for above, the theorem then follows.
a.3 Proof of Theorem 3.6
We first show the following properties of matrix which will be useful for the proof later.
Lemma A.1
The matrix L is symmetric and positive semidefinite. Furthermore, for every vector , we have
(2) 
Proof.
By construction, it is easy to see that is symmetric as matrices and are. The positive semidefiniteness follows from Eqn (2) which we prove now.
The lemma then follows. ∎
We now prove the statement in Theorem 3.6. Recall that the definition of various matrices, and that ’s are the row vectors of matrix . For simplicity, in the derivations below, we use to denote the induced WKPIdistance between persistence diagrams and . Applying Lemma A.1, we have:
(3)  
Now by definition of , it is nonzero only when . Combined with Eqn (3), it then follows that:
This proves the first statement in Theorem 3.6. We now show that the matrix is the identity matrix . Specifically, first consider ; we claim:
It equals to because is nonzero only for , while is nonzero only for . However, for such a pair of and , obviously , which means that . Hence the sum is for all possible and ’s.
Now for the diagonal entries of the matrix , we have that for any :
This finishes the proof that , and completes the proof of Theorem 3.6.
Appendix B More details for Experiments
b.1 Description of Neuron datasets
Neuron cells have natural tree morphology (see Figure 4 (a) for an example), rooted at the cell body (soma), with dentrite and axon branching out. Furthermore, this tree morphology is important in understanding neurons. Hence it is common in the field of neuronscience to model a neuron as a (geometric) tree (see Figure 4 (b) for an example downloaded from NeuroMorpho.Org).
Our NeuBin dataset consists of 1126 neuron trees classified into two (primary) classes: interneuron and principal neurons (data partly from the Blue Brain Project [33] and downloaded from http://neuromorpho.org/). The second NeuMulti dataset is a refinement of the 459 interneuron class into four (secondary) classes: basketlarge, basketnest, neuglia and martino.
(a)  (b) 
Dataset  #classes  #graphs  average #nodes  average #edges 

NCI1  2  4110  29.87  32.30 
NCI109  2  4127  29.68  31.96 
PTC  2  344  14.29  14.69 
PROTEIN  2  1113  39.06  72.82 
DD  2  1178  284.32  715.66 
IMDBBINARY  2  1000  19.77  96.53 
IMDBMULTI  3  1500  13.00  65.94 
REDDIT5K  5  4999  508.82  594.87 
REDDIT12K  11  12929  391.41  456.89 
b.2 Setup for persistence images
For each dataset, the persistence image for each object inside is computed within the rectangular bounding box of the points from all persistence diagrams of input trees. The direction is then discretized to uniform intervals, while the direction is discretized accordingly so that each pixel is a square. For persistence image (PI) approach of [1], we show results both for the unweighted persistence images (PICONST), and one, denoted by PIPL, where the weight function (for Definition 2.1) is the following piecewiselinear function (modified from one proposed by Adams et al. [1]) where the largest persistence for any persistentpoint among all persistence diagrams.
(4) 
Datasets  Existing TDA approaches  Alternative metric learning  Our WKPI framework  

PWGK  PICONST  PIPL  SW  trainPWGK  altWKPI  WKPIkM  WKPIkC  
NCI1  73.3  72.5  72.1  80.1  76.5  77.4  87.2  84.7 
NCI109  71.5  74.3  73.1  75.5  77.2  81.2  85.6  87.3 
PTC  62.2  61.3  64.2  64.5  62.5  64.2  63.1  67.1 
PROTEIN  73.6  72.2  69.1  76.4  74.8  75.1  78.8  74.9 
DD  75.2  74.2  76.8  78.9  76.4  72.5  82.0  80.3 
MUTAG  82.0  85.2  83.5  87.1  86.4  88.5  86.9  87.5 
IMDBBINARY  66.8  65.5  69.7  69.6  71.8  67.3  70.7  75.4 
IMDBMULTI  43.4  42.5  46.4  48.7  45.8  45.3  46.4  49.5 
REDDIT5K  47.6  52.2  51.7  53.8  53.5  54.7  58.5  60.2 
REDDIT12K  38.5  43.3  45.7  48.3  43.7  42.1  47.7  48.6 
Average  63.41  64.3  65.23  68.29  66.86  66.83  70.62  71.65 
b.3 Benchmark datasets for graph classification
Below we first give a brief description of the benchmark datasets we used in our experiments. These are collected from the literature.
NCI1 and NCI109 [39] consist of two balanced subsets of datasets of chemical compounds screened for activity against nonsmall cell lung cancer and ovarian cancer cell lines, respectively.
PTC [21] is a dataset of graph structures of chemical molecules from rats and mice which is designed for the predictive toxicology challenge 20002001.
DD [18] is a data set of 1178 protein structures. Each protein is represented by a graph, in which the nodes are amino acids and two nodes are connected by an edge if they are less
than 6 Angstroms apart. They are classified according to whether they are enzymes or not.
PROTEINS [5] contains graphs of protein. In each graph, a node represents a secondary structure element (SSE) within protein structure, i.e. helices, sheets and turns. Edges connect nodes if they are neighbours along amino acid sequence or neighbours in protein structure space. Every node is connected to its three nearest spatial neighbours.
MUTAG [16] is a dataset collecting 188 mutagenic aromatic and heteroaromatic nitro compounds labelled according to whether they have a mutagenic effect on the Gramnegtive bacterium Salmonella typhimurium.
REDDIT5K and REDDIT12K [43] consist of graph representing the discussions on the online forum Reddit. In these datasets, nodes represent users and edges between two nodes represent whether one of these two users leave comments to the other or not. In REDDIT5K, graphs are collected from 5 subforums, and they are labelled by to which subforums they belong. In REDDIT12K, there are 11 subforums involved, and the labels are similar to those in REDDIT5K.
IMDBBINARY and IMDBMULTI [43] are dataset consists of networks of 1000 actors or actresses who played roles in movies in IMDB. In each graph, a node represents an actor or actress, and an edge connects two nodes when they appear in the same movie. In IMDBBINARY, graphs are classified into Action and Romance genres. In IMDBMULTI, they are collected from three different genres: Comedy, Romance and SciFi.
The statistics of these datasets are provided in Table LABEL:tbl:datastats.
b.4 Topologicalbased methods on graph data
Here we compare our WKPIframework with the performance of several stateoftheart persistencebased classification frameworks, including: PWGK [28], SW [10], PI [1]. We also compare it with two alternative ways to learn the metric for persistencebased representations: trainPWGK is the version of PWGK [28] where we learn the weight function in its formulation, using the same costfunction as what we propose in this paper for our WKPI kernel functions. altWKPI is the alternative formulation of a kernel for persistence images where we set the kernel to be , instead of our WKPIkernel as defined in Definition 3.1. We use the same setup as our WKPIframework to train these two metrics, and use their resulting kernels for SVM to classify the benchmark graph datasets. WKPIframework outperforms the existing approaches and alternative metric learning methods on all datasets except MUTAG. WKPIkM (i.e, WKPIkmeans) and WKPIkC (i.e, WKPIkcenter) improve the accuracy by and , respectively. The classification accuracy of all these methods are reported in Table LABEL:tbl:persbenchmark.