Comparing Graphs via Persistence Distortion

Comparing Graphs via Persistence Distortion

Tamal Dey Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA. Emails: tamaldey, shiday, yusu@cse.ohio-state.edu     Dayu Shi    Yusu Wang
Abstract

Metric graphs are ubiquitous in science and engineering. For example, many data are drawn from hidden spaces that are graph-like, such as the cosmic web. A metric graph offers one of the simplest yet still meaningful ways to represent the non-linear structure hidden behind the data. In this paper, we propose a new distance between two finite metric graphs, called the persistence-distortion distance, which draws upon a topological idea. This topological perspective along with the metric space viewpoint provide a new angle to the graph matching problem. Our persistence-distortion distance has two properties not shared by previous methods: First, it is stable against the perturbations of the input graph metrics. Second, it is a continuous distance measure, in the sense that it is defined on an alignment of the underlying spaces of input graphs, instead of merely their nodes. This makes our persistence-distortion distance robust against, for example, different discretizations of the same underlying graph.

Despite considering the input graphs as continuous spaces, that is, taking all points into account, we show that we can compute the persistence-distortion distance in polynomial time. The time complexity for the discrete case where only graph nodes are considered is much faster. We also provide some preliminary experimental results to demonstrate the use of the new distance measure.

1 Introduction

Many data in science and engineering are drawn from hidden spaces which are graph-like, such as the cosmic web [30] and road networks [2, 8]. Furthermore, as modern data become increasingly complex, understanding them with a simple yet still meaningful structure becomes important. Metric graphs equipped with a metric derived from the data can provide such a simple structure [19, 29]. They are graphs where each edge is associated with a length inducing the metric of shortest path distance. The comparison of the representative metric graphs can benefit classification of data, a fundamental task in processing them. This motivates the study of metric graphs in the context of matching or comparison.

To compare two objects, one needs a notion of distance in the space where the objects are coming from. Various distance measures for graphs have been proposed in the literature with associated matching algorithms. We approach this problem with two new perspectives: (i) We aim to develop a distance measure which is both meaningful and stable against metric perturbations, and at the same time amenable to polynomial time computations. (ii) Unlike most previous distance measures which are discrete in the sense that only graph node alignments are considered, we aim for a distance measure that is continuous, that is, alignment for all points in the underlying space of the metric graphs are considered.

Related work.

To date, the large number of proposed graph matching algorithms fall into two broad categories: exact graph matching methods and inexact graph matching (distances between graphs) methods. The exact graph matching, also called the graph isomorphism problem, checks whether there is a bijection between the node sets of two input graphs that also induces a bijection in their edge sets. While polynomial time algorithms exist for many special cases, e.g., [4, 22, 26], for general graphs, it is not known whether there exists polynomial time algorithm for the graph isomorphism problem, (despite the ground-breaking recent work by Babai showing that it can be solved in quasi-polynomial time [5]). Nevertheless, given the importance of this problem, there are various exact graph matching algorithms developed in practice. Usually, these methods employ some pruning techniques aiming to reduce the search space for identifying graph isomorphisms. See [17] for comparisons of various graph isomorphism testing methods.

In real world applications, input graphs often suffer from noise and deformation, and it is highly desirable to obtain a distance between two input graphs beyond the binary decision of whether they are the same (isomorphic) or not. This is referred to as inexact graph matching in the field of pattern recognition, and various distance measures have been proposed. One line of work is based on graph edit distance which is NP-hard to compute [34]. Many heuristic methods, using for example algorithms, have been proposed to address the issue of high computational complexity, see the survey [18] and references within. One of the main challenges in comparing two graphs is to determine how ”good” a given alignment of graph nodes is in terms of the quality of the pairwise relations between those nodes. Hence matching two graphs naturally leads to an integer quadratic programming problem (IQP), which is a NP-hard problem. Several heuristic methods have been proposed to approach this optimization problem, such as the annealing approach of [20], iterative methods of [25, 32] and probabilistic approach in [33]. Finally, there have been several methods that formulate the optimization problem based on spectral properties of graphs. For example, in [31], the author uses the eigendecomposition of adjacency matrices of the input graphs to derive an expression of an orthogonal matrix which optimizes the objective function. In [12, 24], the principal eigenvector of a “compatibility” matrix of the input graphs is used to obtain correspondences between input graph nodes. Recently in [23], Hu et. al proposed the general and descriptive Laplacian family signatures to build the compatibility matrix and model the graph matching problem as an integer quadratic program.

New work.

Unlike previous approaches, we view input graphs as continuous metric spaces. Intuitively, we assume that our input is a finite graph where each edge is assigned a positive length value. We now consider as a metric space on the underlying space of , with metric being the shortest path metric in . Given two metric graphs and , a natural way to define their distance is to use the so-called Gromov-Hausdorff distance [21, 27] that measures the metric distortion between these two metric spaces. Unfortunately, it is NP-hard to even approximate the Gromov-Hausdorff distance for graphs within a constant factor [3]. Instead, we propose a new metric, called the persistence-distortion distance , which draws upon a topological idea and is computable in polynomial time with techniques from computational geometry. This provides a new angle to the graph comparison problem. The distance that we define has several nice properties:

\parpic

[r] (1) The persistence-distortion distance takes into account all points in the geometric realization of the input graphs, while all previous graph matching algorithms align only graph nodes. Hence our persistence-distortion distance is insensitive to different discretization of the same graph: For example, the two geometric graphs on the right are equivalent as metric graphs, and thus the persistence-distortion distance between them is zero.

(2) In Section 3, we show that the persistence-distortion distance is stable w.r.t. changes to input metric graphs as measured by the Gromov-Hausdorff distance. For example, the two geometric graphs on the right have small persistence-distortion distance. (Imagine that they are the reconstructed road networks from noisy data sampled from the same road systems.)

(3) Despite that the persistence-distortion distance is a continuous measure which considers all points in a geometric realization of the input graphs, we show in Section 5 that it can be computed in polynomial time ( where is the total number of nodes and edges of the input graphs). We note that the discrete version of the persistence-distortion distance, where only graph nodes are considered (much like in previous graph matching algorithms), can be computed much more efficiently in time, where is the number of graph nodes in input graphs.

Finally, we also provide some preliminary experimental results to demonstrate the use of the persistence-distortion distance.

2 Notations and Proposed Distance Measure for Graphs

Metric graphs.

A metric graph is a metric space where is the underlying space of a finite -dimensional simplicial complex. Given a graph and a weight function on its edge set (assigning length to edges in ), we can associate a metric graph to it as follows. The space is a geometric realization of . Let denote the image of an edge in . To define the metric , we consider the arclength parameterization for every edge and define the distance between any two points as . This in turn provides the length of a path between two points that are not necessarily on the same edge in , by simply summing up the lengths of the restrictions of this path to edges in . Finally, given any two points , the distance is given by the minimum length of any path connecting and in .

In what follows, we do not distinguish between and its argument and write to denote the metric graph for simplicity. Furthermore, for simplicity in presentation, we abuse the notations slightly and refer to the metric graph as , with the understanding that refers to the graph representing the metric space . Finally, we refer to any point (i.e, ) as a point, while a point as a graph node.

Background on persistent homology.

The definition of our proposed distance measure for two metric graphs relies on the so-called persistence diagram induced by a scalar function. We refer the readers to resources such as [14, 15] for formal discussions on persistent homology and related developments. Below we only provide an intuitive and informal description of the persistent homology induced by a function under our simple setting.

Let be a continuous real-valued function defined on a topological space . We want to understand the structure of from the perspective of the scalar function : Specifically, let denote the super-level set111In the standard formulation of persistent homology of a scalar field, the sub-level set is often used. We use super-level sets which suit the specific functions that we use. of w.r.t. . Now as we sweep top-down by decreasing the value ; the sequence of super-level sets equipped with natural inclusion maps gives rise to a filtration of induced by :

(1)

We track how the topological features captured by the so-called homology classes of the super-level sets change. In particular, as decreases, sometimes new topological features are “born” at time , that is, new families of homology classes are created in , the -th homology group of . Sometimes, existing topological features disappear, i.e, some homology classes become trivial in for some . The persistent homology captures such birth and death events, and summarizes them in the persistence diagram . Specifically, consists of a set of points in the plane, where each indicates a homological feature created at time and killed entering time .

In our setting, the domain will be the underlying space of a metric graph . The specific function that we use later is the geodesic distance to a fixed basepoint , that is, we consider where for any . We are only interested in the 0th-dimensional persistent homology ( in the above description), which simply tracks the connected components in the super-level set as we decrease .

(a) (b) (c) (d)
Figure 1: (a) A graph with basepoint : edge length is marked for each edge. (b) The function . We also indicate critical-pairs. (c) Persistence diagram : E.g, the persistence-point is generated by critical-pair . (d) A partial matching between the red points and blue points (representing two persistence diagrams). Some points are matched to the diagonal .

Figure 1 gives an example of the 0-th persistence diagram with the basepoint in edge . As we sweep the graph top-down in terms of the geodesic function , a new connected component is created as we pass through a local maximum of the function . A local maximum of such as in Figure 1 (b) is not necessarily a graph node from . Two connected components in the super-level set can only merge at an up-fork saddle of the function : The up-fork saddle is a point that has a neighborhood with at least two branches incident on whose function values are larger than . Each point in the persistence diagram is called a persistence point, corresponding to the creation and death of some connected component: At time , a new component is created in at a local maximum with . At time and at an up-fork saddle with , this component merges with another component created earlier. We refer to the pair of points from the graph as the critical-pair corresponding to the persistent point . We call and the birth-time and death-time, respectively. The plane containing the persistence diagram is called the birth-death plane.

Finally, given two finite persistence diagrams and , a common distance measure for them, the bottleneck distance [9], is defined as follows: Consider and as two finite sets of points in the plane (where points may overlap). Call the diagonal of the birth-death plane.

Definition 1

A partial matching of and is a relation such that each point in is either matched to a unique point in , or mapped to its closest point (under -norm) in the diagonal ; and the same holds for points in . See Figure 1 (d). The bottleneck distance is defined as , where ranges over all possible partial matchings of and . We call the partial matching that achieves the bottleneck distance as the bottleneck matching.

Proposed persistence-distortion distance for metric graphs.

Suppose we are given two metric graphs and . Let and denote the node set and edge set for and , respectively. Set and .

Choose any point as the base point, and consider the shortest path distance function defined as for any point . Let denote the 0-th dimensional persistence diagram induced by the function . Define and similarly for any base point for the graph . We map the graph to the set of (infinite number of) points in the space of persistence diagrams , given by . Similarly, map the graph to .

Definition 2

The persistence-distortion distance between and , denoted by , is the Hausdorff distance between the two sets and where the distance between two persistence diagrams is measured by the bottleneck distance. In other words,

Remark.

(1) We note that if two graphs are isomorphic, then . The inverse unfortunately is not true. (See Figure 2 for an example where two graphs have , but they are not isomorphic.) Hence is a pseudo-metric (it inherits the triangle-inequality property from the Hausdorff distance). (2) While the above definition uses only the 0-th persistence diagram for the geodesic distance functions, all our results hold with the same time complexity when we also include the 1st-extended persistence diagram [10] or equivalently 1st-interval persistence diagram [13] for each geodesic distance function (resp. ).

A B C
Figure 2: In the top row, we show three components and , that will be used to construct the two input metric graphs and in the bottom row. All edges are of length 1. Each of these components has the property that, from a basepoint outside these components, the persistence diagram w.r.t. the geodesic distance remains the same (and hence cannot be distinguished). Graphs and are not isomorphic. However, they are mapped to the same image set in the space of persistence diagrams, and hence the persistence-distortion distance between them is zero; i.e, .

3 Stability of persistence-distortion distance

Gromov-Hausdorff distance.

There is a natural way to measure metric distortion between metric spaces (thus for metric graphs) by the Gromov-Hausdorff distance [21, 7]. Given two metric spaces and , a correspondance between and is a relation such that (i) for any , there exists and (ii) for any , there exists . The Gromov-Hausdorff distance between and is

(2)

where ranges over all correspondences of . The Gromov-Hausdorff distance is a natural distance between two metric spaces; see [27] for more discussions. Unfortunately, so far, there is no efficient (polynomial-time) algorithm to compute or approximate this distance, even for special metric spaces – In fact, it has been recently shown that even the discrete Gromov-Hausdorff distance for metric trees (where only tree nodes are considered) is NP-hard to compute as well as to approximate within a constant factor (see footnote 1). In contrast, as we show in Section 4 and 5, the persistence-distortion distance can be computed in polynomial time.

On the other hand, we have the following stability result, which intuitively suggests that the persistence-distortion distance is a weaker relaxation of the Gromov-Hausdorff distance. The proof of this theorem leverages a recent result on measuring distances between the Reeb graphs [6].

Theorem 3 (Stability)

.

By triangle inequality, this also implies that given two metric graphs and and their perturbations and , respectively, we have that:

Proof of Theorem 3.

The remainder of this section devotes to the proof of the above theorem.

Given two input metric graphs and , set to be the Gromov-Hausdorff distance between and . Assume that the correspondence achieves this metric distortion distance 222It is possible that is achieved in the limit, in which case, we consider a sequence of -correspondences whose corresponding metric distortion distance converges to as tends to . For simplicity, we assume that can be achieved by the correspondence . . Now for any point , there must exist such that . We will now show that . Symmetrically, we show that for any , there is such that . Since such a can be found for any point , and symmetrically, such an can be found for any , it then follows that the Hausdorff distance between and is bounded from above by , proving the theorem.

We will prove for with the help of another distance, the so-called functional-distortion distance between two Reeb graphs introduced in [6]. We recall its definition below.

The functional-distortion distance is defined between two graphs and , with a function and defined on each of them, respectively. (In our case, will later be taken as the shortest path distance function and will be taken as .) First, we define the following (pseudo-)metric on the input graphs as induced by and , respectively. (It is important to note that these metrics are different from the path-length distance metrics and that input graphs already come with.) Specifically, given two points , define

(3)

where ranges over all paths in from to , and is the maximum -function value difference for points from the path . Define the metric for similarly.

Now given two continuous maps and , we consider the following continuous matching, which is a correspondence induced by a pair of continuous maps and :

(4)

The distortion induced by and is defined as:

(5)

The functional-distortion distance between two metric graphs and is defined as:

(6)

where and range over all continuous maps between and . It is shown in [6] that

Theorem 4 ([6])

, where and denote the 0th-dimensional persistence diagram induced by the function and , respectively.

In what follows, we show that for and . Note that in this case and . Combining with Theorem 4, this then implies .

Remark.

We note that Theorem 4 extends to the case where we consider the st-extended persistence diagrams for and , respectively, in which case the constant in front of will change from to . In other words, if we include the 1st-extended persistence diagrams in our definitions of the persistence-distortion distance, then Theorem 3 still holds with a slightly worst constant of (instead of ).

Lemma 5

for and .

Proof.

First, we introduce the function-restricted Gromov-Hausdorff distance , which is a more restricted version of the Gromov-Hausdorff distance, defined as follows:

(7)

where ranges over all correspondences between graphs and . Compared to the definition of the Gromov-Hausdorff distance in Eqn (2), the functional Gromov-Hausdorff distance has an extra condition that the function value difference between a pair of corresponding points and should also be small.

We claim that . The proof follows almost exactly the same as the proof of Theorem A.1 of [6] (or Theorem 5.1 of the full arXiv version). Specifically, in Theorem A.1 of [6], it states that , where the so-called functional Gromov-Hausdorff distance defined as:

where ranges over all correspondences between graphs and . In other words, the difference between and our is that the metric on (resp. on ) is versus the input graph metric (resp. versus ). Nevertheless, it turns out that the proof of can be easily modified to prove . Specifically, one property that will be used many times in this modification is that for any (a symmetric statement holds for points in ). Since the proof is almost verbatim of the proof for Theorem A.1 in [6] (Theorem 5.1 of the full arXiv version), we omit it here.

Given the optimal correspondence for the Gromov-Hausdorff, it is easy to see that for any pair . For the correspondence , the other two terms in Equation 7 are both bounded by . It then follows that

Combining this with , the lemma then follows. ∎

We remark that one can in fact further modify the proof of Theorem A.1 of [6] to obtain a smaller constant in Lemma 5.

Putting everything together we obtain Theorem 3.

A corollary of Lemma 5.

We note that in the case where the metric graphs and are trees, say we have two metric trees and . If a tree, say , is associated with a function such that the function value of is monotonically decreasing from the root to any leaf, then we also refer to equipped with the function a merge tree, denoted by . (Similarly, denote by a merge tree equipped with a function .) Morozov et al. [28] introduces a so-called interleaving distance for two merge trees, denoted by . It is shown in [6] (Theorem 6.2 in the arXiv version) that , where and are induced by and as defined in Eqn (3), respectively. By Lemma 5 we then have the following result, which could be of independent interests.

Corollary 6

Given two metric trees and , let and be such that is from an optimal correspondance realizing the Gromov-Hausdorff distance between and . Consider the functions and for base points and . We then have that

4 Discrete PD-Distance

Suppose we are given two connected metric graphs and , where the shortest distance metrics and are induced by lengths associated with the edges in . As a warm-up, we first consider the following discrete version of persistence-distortion distance where only graph nodes in and are used as base points:

Definition 7

Let and be two discrete sets of persistence diagrams. The discrete persistence-distortion distance between and , denoted by , is given by the Hausdorff distance .

We note that while we only consider graph nodes as base points, the local maxima of the resulting geodesic function may still occur in the middle of an edge. Nevertheless, for a fixed base point, each edge could have at most one local maximum, and its location can be decided in time once the shortest-path distance from the base point to the endpoints of this edge are known. The observation below follows from the fact that geodesic distance is 1-Lipschitz (as the basedpoint moves) and from the stability of persistence diagrams.

Observation 8

, where is the largest length of any edge in .

Lemma 9

Given connected metric graphs and , can be computed in time, where and .

Proof.

For a given base point (or ), computing the shortest path distance from to all other graph nodes, as well as the persistence diagram (or ) takes time. Hence it takes total time to compute the two collections of persistence diagrams and .

Each persistence diagram has number of points in the plane – it is easy to show that there are number of local maxima of the geodesic function (some of which may occur in the interior of graph edges). Since the birth time of every persistence point corresponds to a unique local maximum with , there can be only points (some of which may overlap each other) in the persistence diagram .

Next, given two persistence diagrams and , we need to compute the bottleneck distance between them. In [16], Efrat et al. gives an time algorithm to compute the optimal bijection between two input sets of points and in the plane such that the maximum distance between any mapped pair of points is minimized. This distance is also called the bottleneck distance, and let us denote it by . The bottleneck distance between two persistence diagrams and is similar to the bottleneck distance , with the extra addition of diagonals. However, let and denote the vertical projection of points in and , respectively, onto the diagonal . It is easy to show that . Hence can be computed by the algorithm of [16] in time. Finally, to compute the Hausdorff distance between the two sets of persistence diagrams and , one can check for all pairs of persistence diagrams from these two sets, which takes time since the and . The lemma then follows. ∎

By Observation 8, only provides an approximation of with an additive error as decided by the longest edge in the input graphs. For unweighted graphs (where all edges have length 1), this gives an additive error of . As is necessarily an integer in this setting, this in turns provides a factor-2 approximation of the continuous persistence-distortion distance in terms of multiplicative error; see the following corollary.

Corollary 10

The discrete persistence-distortion distance provides a factor-2 (multiplicative) approximation of the continuous persistence-distortion distance for two graphs and with unit edge lengths; that is, .

One may add additional (steiner) nodes to edges of input graphs to reduce the longest edge length, so that the discrete persistence-distortion distance approximates the continuous one within a smaller additive error. But it is not clear how to bound the number of steiner nodes necessary for approximating the continuous distance within a multiplicative error, even for the case when all edges weights are approximately 1. Indeed, even when all the edges of two input graphs have weights that are roughly 1, it is possible that the persistence-distortion distance is much smaller than 1. Below we show how to directly compute the continuous persistence-distortion distance exactly in polynomial time.

5 Computation of Continuous Persistence-distortion Distance

We now present a polynomial-time algorithm to compute the (continuous) persistence-distortion distance between two metric graphs and . As before, set and . Below we first analyze how points in the persistence diagram change as we move the basepoint in and continuously.

5.1 Changes of persistence diagrams

We first consider the scenario where the basepoint moves within a fixed edge of , and analyze how the corresponding persistence diagram changes. Using notations from Section 2, let be the critical-pair in that gives rise to the persistence point . Then is a maximum for the distance function , while is an up-fork saddle for . We call and from the birth point and death point w.r.t. the persistence-point in the persistence diagram .

As the basepoint moves to within distance along the edge for any , the distance function is perturbed by at most ; that is, . By the Stability Theorem of the persistence diagrams [9], we have that . Hence as the basepoint moves continuously along , points in the persistence diagram move continuously. There could be new persistence points appearing or current points disappearing in the persistence diagram as moves. Both creation and deletion necessarily happen on the diagonal of the persistence diagram as necessarily tends to 0 as approaches . For simplicity of presentation, for the time being, we describe the movement of persistence points ignoring their creation and deletion. Such creation and deletion will be addressed later in Section 5.1.3.

We now analyze how a specific point may change its trajectory as moves from one endpoint of to the other endpoint .

Specifically, we use the arc-length parameterization of for , that is, . For any object , we use to denote the object w.r.t. basepoint . For example, is the persistence-point w.r.t. basepoint , while and are the corresponding pair of local maximum and up-fork saddle that give rise to . We specifically refer to and as the birth-time function and the death-time function, respectively. By the discussion from the previous paragraphs on stability of persistence diagrams, these two functions are continuous.

(a) (b)
Figure 3: For better illustration of ideas, we use height function defined on a line to show: (a) a max-max critical event at ; and (b) a saddle-saddle critical event at .
Critical events.

To describe the birth-time and death-time functions, we need to understand how the corresponding birth-point and death-point and in change as the basepoint varies. Recall that as moves, the birth-time and death-time change continuously. However, the critical points and in may (i) stay the same or move continuously, or (ii) have discontinuous jumps. Informally, if it is case (i), then we show below that we can describe and using a piecewise linear function with complexity. Case (ii) happens when there is a critical event where two critical-pairs () and swap their pairing partners to and . Specifically, at a critical event, since the birth-time and death-time functions are still continuous, it is necessary that either or ; we call the former a max-max critical event and the latter a saddle-saddle critical event. See Figure 3 for an illustration. It turns out that the birth-time function (resp. death-time function ) is a piecewise linear function whose complexity depends on the number of critical events, which we analyze below.

5.1.1 The death-time function

The analysis of death-time function is simpler than that of the birth-time function; so we describe it first. Observe that is the geodesic distance to the base point . Consequently, merging of two components at an up-fork saddle cannot happen in the interior of an edge, unless at the basepoint itself.

Observation 11

An upper-fork saddle is necessarily a graph node from with degree at least unless .

(case-1) (case-2) (c)
Figure 4: (c) Graph of function .

To simplify the exposition, we omit the easier case of in our discussions below. Since the up-fork saddles now can only be graph nodes, as the basepoint moves, the death-point either (case-1) stays at the same graph node, or (case-2) switches to a different up-fork saddle (i.e, a saddle-saddle critical event); see Figure 4.

Now for any point , we introduce the function which is the distance function from to the moving basepoint for ; that is, . Intuitively, as the basepoint moves along , the distance from to a fixed point either increases or decreases at unit speed, until it reaches a point where the shortest path from to changes discontinuously though the shortest path distance still changes continuously. We have the following observation.

Claim 12

For any point , as the basepoint moves in an edge , the distance function defined as is a piecewise linear function with at most 2 pieces, where each piece has slope either ‘1’ or ‘-1’. See Figure 4 (c).

Proof.

Let and be the two endpoints of the edge where the basepoint lies in. For a fixed point , first consider the shortest path tree with being the source point (root). If the edge is a tree edge in the shortest path tree , then as moves from to , the shortest path from to changes continuously and the distance increases or decreases at unit speed. In this case, the function contains only one linear piece with slope either ‘1’ (if is moving towards ) or ‘-1’ (if is moving away from ).

Otherwise, the shortest distance to from will be the shorter of the shortest distance to plus the distance from to , for or and . That is,

The two functions in the above equation are linear with slope ‘1’ and ‘-1’, respectively. The graph of is the lower envelop of the graphs of these two linear functions, and the claim thus follows.

We note that the break point of the function , where it changes to a different linear function, happens at the value such that , and it is easy to check that is a local maximum of the distance function . ∎

As moves, if the death-point stays at the same up-fork saddle , then by the above claim, the death-time function (which locally equals ) is a piecewise linear function with at most 2 pieces.

Now we consider (case-2) when a saddle-saddle critical event happens: Assume that as passes value , switches from a graph node to another one . At the time when this swapping happens, we have that . In other words, the graph for function and the graph for function