Interesting Paths in the Mapper

# Interesting Paths in the Mapper

## Abstract

Given a high dimensional point cloud of data with functions defined on the points, the Mapper produces a compact summary in the form of a simplicial complex connecting the points. This summary offers insightful data visualizations, which have been employed in applications to identify subsets of points, i.e., subpopulations, with interesting properties. These subpopulations typically appear as long paths, flares (i.e., branching paths), or loops in the Mapper.

We study the problem of quantifying the interestingness of subpopulations in a given Mapper. First, we create a weighted directed graph using the -skeleton of the Mapper. We use the average values at the vertices (i.e., clusters) of the target function (i.e., a dependent variable) to direct the edges from low to high values. We set the difference between the average values at the vertices (highlow) as the weight of the edge. Covariation of the remaining functions (i.e., independent variables) is captured by a -bit binary signature assigned to the edge. An interesting path in is a directed path whose edges all have the same signature. Further, we define the interestingness score of such a path as a sum of its edge weights multiplied by a nonlinear function of their corresponding ranks, i.e., the depths of the edges along the path. The goal is to value more the contribution from an edge deep in the path than that from a similar edge which appears at the start.

Second, we study three optimization problems on this graph to quantify interesting subpopulations. In the problem Max-IP, the goal is to find the most interesting path in , i.e., an interesting path with the maximum interestingness score. We show that Max-IP is NP-complete. For the special case where is a directed acyclic graph (DAG), which could be a typical setting in many applications, we show that Max-IP can be solved in polynomial time—in time and space, where , and are the numbers of edges, vertices, and the maximum indegree of a vertex in , respectively.

In the more general problem IP, the goal is to find a collection of interesting paths such that these paths form an exact cover of (hence they are edge-disjoint) and the overall sum of interestingness scores of all paths is maximized. We also study a variant of IP termed -IP, where the goal is to identify a collection of edge-disjoint interesting paths each with edges, and the total interestingness score of all paths is maximized. While -IP can be solved in polynomial time for , we show -IP is NP-complete for . Further, we show that -IP remains NP-complete for even for the case when is a DAG. We develop heuristics for IP and -IP on DAGs, which use the algorithm for Max-IP on DAGs as a subroutine, and run in and time for IPand -IP , respectively.

## 1 Introduction

Data sets from many applications come in the form of point clouds often in high dimensions along with multiple functions defined on these points. Topological data analysis (TDA) has emerged in the past two decades as a new field whose goal is to summarize such complex data sets, and facilitate the understanding of their underlying topological and geometric structure. We focus on Mapper, a TDA method originally introduced by Singh et al. [18].

Starting from a point cloud (typically sampled from a metric space), the Mapper studies the topology of the sublevel sets of a filter function . Starting with a cover of , the Mapper obtains a cover of the domain by pulling back through . This pullback cover is then refined into a connected cover by splitting each of its elements into various clusters using a clustering algorithm that employs another function defined on . A compact representation of the data set, also term Mapper, is obtained by taking the nerve of this connected cover—this is a simplicial complex with one vertex per each cluster, one edge per pair of intersecting clusters, and one -simplex per non-empty -fold intersection in general. The method can naturally consider multiple filter functions , where the covers are jointly pulled back to obtain the cover of . Equivalently, one could consider them together as a single vector-valued filter function for .

The Mapper has been used in a growing number of applications from diverse domains recently, ranging from medicine [9, 12, 14, 15, 16, 17, 19] to basketball player profiles [1] to voting patterns [13]. It is also the main engine in the data analytics software platform of the firm Ayasdi. The key to all these success stories is the ability of Mapper to identify subsets of , i.e., subpopulations, that behave distinctly from the rest of the points. In fact, this feature of Mapper distinguishes it from many standard data analysis techniques based on, e.g., machine learning, where the goal is usually to identify patterns valid for the entire data set. We are currently investigating the use of Mapper-based approaches for large scale phenomics data sets (which study how genotypes interact with environment to determine the phenotype) [10]. The problems we study in this paper are directly motivated by applications in phenomics, while also being relevant in general for other types of scientific high-dimensional data. The remarkable subpopulations from are typically identified as forming long paths, flares with branches, or loops in the Mapper.

Concurrently, several researchers have recently studied various mathematical and foundational aspects of the Mapper. Employing ideas from topological persistence [5], Carrière and Oudot [3] have characterized the structure and stability of the 1-dimensional Mapper. In particular, they proposed a theoretical framework using which one could predict which features will be present in the Mapper given a filter function and its cover, and also to predict the (in)stability of each feature. Dey et al. [4] have studied a multiscale mapper, where they employ a tower of covers (as opposed to a single cover) for , whose pullback induces a tower of simplicial complexes. Under certain assumptions, they characterized the stability of the multiscale mapper, and propose algorithms to compute its persistence diagram. More recently, Carrière et al. [2] considered aspects of statistical analysis and parameter selection for the Mapper. In particular, they showed the existence of a specific set of parameters to construct the Mapper so that the resulting version is an optimal estimate of its continuous analog, thus avoiding the need to test large numbers of candidates in a brute force setting.

The theoretical results outlined above suggest robust ways to build one representative Mapper for any given data set. For instance, one could follow the work of Carrière et al. [2] or identify parameters corresponding to a stable range in the persistence diagram of the multiscale mapper [4]. While a collection of mappers built at multiple scales, e.g., the multiscale mapper [4], could provide a more detailed representation of the data, efficient summarization of the entire representation as well as the extraction of insights relevant to the application still remain challenging. Hence working with a single Mapper could be considered the desirable setting from the point of view of most applications. At the same time, many applications demand more precise quantification of the interesting features in the selected Mapper, as well as to track the corresponding subpopulations as they evolve along such features. In phenomics, for instance, we are interested in identifying which specific varieties (i.e., genotypes) of a crop show resilient growth rates as several environmental factors vary in specific ways during the entire growing season. Each such subpopulation with the associated variation would suggest a testable hypothesis for the practitioner to verify by conducting further experiments. For this purpose, it is also desirable to rank these subpopulations in terms of their “interestingness” to the practitioner.

### 1.1 Our contributions

We propose a framework for quantifying the interestingness of subpopulations in a given Mapper. For the input point cloud , we assume the Mapper is constructed with filter functions that represent independent variables, and a target function (i.e., a dependent variable) . The Mapper could be a high-dimensional simplicial complex depending on the choice of covers for .

First, we create a weighted directed graph using the -skeleton of the Mapper. We use the average values of at the vertices (i.e., clusters) to direct the edges from low to high values. We set the difference between the average values at the vertices (highlow) as the weight of the edge. Covariation of the functions is captured by a -bit binary signature assigned to the edge. We define an interesting path in as a directed path4 whose edges all have the same signature. Further, we define the interestingness score of such a path as a sum of its edge weights multiplied by a nonlinear function of their corresponding ranks, i.e., the depths of the edges along the path. The goal is to value more the contribution from an edge deep in the path than that from a similar edge which appears at the start.

Second, we study three optimization problems on this graph to quantify interesting subpopulations. In the problem Max-IP, the goal is to find the most interesting path in , i.e., an interesting path with the maximum interestingness score. We show that Max-IP is NP-complete. For the special case where is a directed acyclic graph (DAG), which could be a typical setting in applications, we show that Max-IP can be solved in polynomial time—in time and space where , and are the numbers of edges, vertices, and the maximum indegree of any vertex in , respectively.

In the more general problem IP, the goal is to find a collection of interesting paths such that these paths form an exact cover of and the overall sum of interestingness scores of all paths is maximum. The collection of paths identified by IP could include some short ones in terms of number of edges. Hence we study also a variant of IP termed -IP, where the goal is to identify a collection of interesting paths each with edges for a given number , an edge in is part of at most one such path, and the total interestingness score of all paths is maximum. While -IP can be solved in polynomial time for , we show -IP is NP-complete for . Further, we show that -IP remains NP-complete for even for the case when is a DAG. Finally, we develop heuristics for IP and -IP on DAGs, which use the algorithm for Max-IP on DAGs as a subroutine, and run in and time for IP and -IP, respectively.

### 1.2 Related work

In most previous applications of Mapper [1, 9, 12, 13, 14, 15, 16, 17, 19], interesting subpopulations are characterized by features (paths, flares, loops) identified in a visual manner. As far as we are aware, our work proposes the first approach to rigorously quantify the interesting features, and to rank them in terms of their interestingness. The works of Carrière et al. [2, 3] present a rigorous theoretical framework for 1-dimensional Mapper, where the features are identified as points in an extended persistence diagram. But this line of work does not address the relative importance of the features in the context of the application generating the data. Our work can be considered as a post processing of the Mapper identified by the methods of Carrière et al. While our framework can naturally consider multiple filter functions for a given Mapper, we do not address the stability of the interesting paths identified.

The interesting paths problems we study are related to the class of nonlinear shortest path and minimum cost flow problems previously investigated. Non-additive shortest paths have been studied [20], and the more general minimum concave cost network flow problem has been shown to be NP-complete [8, 21]. At the same time, versions of shortest path or minimum cost flow problems where the contribution of an edge depends nonlinearly on its position or depth in the path appear to have not received much attention. Hence the specific problems we study should be of independent interest as a new class of nonlinear longest path (equivalently, shortest path) problems.

## 2 Methods

We refer the reader to the original paper by Singh et al. [18] for background on Mapper, and recent other work [2, 3, 4] for related constructions. For our purposes, we start with a single Mapper that is a possibly high dimensional simplicial complex constructed from a point cloud using continuous filter functions for and another the continuous function . In the setting of a typical application, could represent a dependent variable whose relationship with the independent variables represesnted by is of interest. We assume is used for clustering within the mapper framework.

### 2.1 Interesting paths and their interestingness scores

Each vertex in represents a cluster of points from that have similar values of function , the dependent variable. An edge in connects two such clusters containing a common subset of points, i.e., a subpopulation. By definition, each edge in connects clusters belonging to distinct elements of the pullback cover of , and hence the corresponding values of the filter functions also change when moving along the edge. Therefore, by following a trail of vertices whose average values are monotonically varying, we can capture subpopulations that gradually or abruptly alter their behavior as measured by under continuously changing filter intervals. In phenomics, for instance, we seek to identify the subset of varieties of the crop that show increasing growth rate () as the temperature () increases and humidity () decreases. Identifying such subpopulations is the basis for formulating hypotheses on the dependence of growth rate on the environmental factors, which the practitioners could test using rigorous experiments. We formulate the problem of identifying such subpopulations as that of finding interesting edge-disjoint paths in a directed graph.

We construct a weighted directed graph which represents the -skeleton of along with some additional information. We set as the set of vertices (-simplices) of , and as the set of -simplices of . We assign directions and weights to the edges as follows. Each vertex denotes a subset of points from that constitute a partial cluster. We denote this subset as . We let and denote the average values of the clustering function (dependent variable) and the filter function , respectively, for all points in :

 g(u)=Σx∈X(u)g(x)|X(u)|  and  fi(u)=Σx∈X(u)fi(x)|X(u)|, i=1,…,h.

For an edge in , we assign as its weight the absolute difference between the average cluster function values of the two vertices: . Notice for all edges in . In addition, the direction of the edge is set from the lower weight vertex to the higher weight vertex—if then , and otherwise. We let and denote the numbers of vertices and edges in , respectively.

We assign a -bit binary signature to oriented edge (i.e., ) to capture the covariation of and the filter functions . We set if , and otherwise.

###### Definition 2.1.

An interesting -path for a given with is a directed path of edges in , such that is identical for all . An interesting path is a path of arbitrary length in the interval .

###### Definition 2.2.

Given an interesting -path in as specified in Definition 2.1, we define its interestingness score as follows.

 I(P)=k∑r=1ω(eir)×log(1+r) (1)

In particular, the contribution of an edge to is defined as , where is the rank or order of edge as it appears in .

###### Remark 2.3.

Intuitively, we use the rank of an edge as an inflation factor for its weight—the later an edge appears in the path, the more its weight will count toward the interestingness of the path. This logic incentivizes the growth of long paths. The log function, on the other hand, helps temper this growth in terms of number of edges. One could use, for instance, the rank of the edge in place of the log function.

###### Remark 2.4.

The above framework is modified easily to characterize robust interesting paths, where the signature matching condition is relaxed such that as long as , for instance.

###### Remark 2.5.

While we assume and are continuous functions from to , our framework could handle more general functions as well. If some is a vector-valued function, for instance, we could first compute pairwise distances of the points in using , and then assign to each point in its average distance to all other points as a “surrogate” function.

###### Remark 2.6.

In our work on phenomics [10], we used by itself for clustering when constructing the Mapper. At the same time, the interesting paths framework can handle without any modification the cases where is used along with other functions to cluster. In fact, could be used also as a filter function along with as long as it is used for clustering.

We now study optimization problems whose goal is to identify interesting path(s) that maximize interestingness score(s).

 Max-IP: Find an interesting path P in G such that I(P) is maximized. Find a collection P of interesting paths in G that form an exact cover of edges in E (i.e., each e∈E is part of exactly one P∈P), and the total interestingness score I(P)=∑P∈PI(P) is maximized. For a given k between 1 and n−1, find a collection P of interesting k-paths such that each e∈E is part of at most one P∈P, and the total interestingness score I(P)=∑P∈PI(P) is maximized.
###### Remark 2.7.

Both IP and -IP produce edge-disjoint collections of interesting paths. In the problem IP, every edge in is part of an interesting path in . But this setting might include several short (in number of edges) interesting paths. In -IP, each interesting path found has exactly edges, and some edges in might not be part of any interesting -path in . Hence -IPmight identify more meaningful (i.e., nontrivial) subpopulations overall.

###### Remark 2.8.

The factor in the interestingness score in Equation (1) makes each of the above optimization problems nonlinear. At the same time, the type of nonlinearity introduced here is distinct from the ones studied in the literature, e.g., in non-additive shortest paths [20], or in minimum concave cost flow [8, 21]. Hence these problems form a new class of nonlinear longest (equivalently, shortest) path problems, which would be of interest independent of their application in the context of the Mapper and TDA.

## 3 The Max-IP Problem

The goal of Max-IP is to identify an interesting path with the maximum interestingness score. We show Max-IP is NP-complete, but is in P on directed acyclic graphs.

### 3.1 Max-IP on directed graphs

In the decision version of Max-IP termed Max-IPD, we are given a directed graph with edge weights and signatures for and a target score . The goal is to determine if there exists an interesting path in whose interestingness score .

###### Lemma 3.1.

Max-IPD on directed graph is NP-complete.

###### Proof.

Given an interesting path in , we can verify that the signatures of all its edges are identical and compute its interestingness score using Equation (1) to compare with in polynomial time. Hence Max-IPD is in NP.

We reduce the problem of checking if a directed graph has a directed Hamiltonian cycle (DirHC) to Max-IPD. DirHC is one of the 21 NP-complete problems originally introduced by Karp [11]. Given an instance of DirHC with , we construct an instance of Max-IPD on a directed graph as follows. We replace an arbitrary vertex by two vertices and , i.e., and . Each is replaced by in and each is replaced by in . All other edges in are included in without changes. All edges in are assigned unit weights and identical signatures, and we set .

We claim that has a directed Hamiltonian cycle if and only if there exists an interesting path in with interestingness score . Let have a directed Hamiltonian cycle . Then must have edges by definition, and visits (i.e., enters and leaves) each vertex in exactly once. Hence there must exist edges and in . We construct the interesting path in using , and the remaining edges in . Thus is a directed path in with edges. Further, since all edges in have unit weights and identical signatures, it is clear from Equation (1) that is indeed an interesting path in with .

Conversely, let be an interesting path in with . Since is an interesting path, it visits (i.e., enters and/or leaves) any vertex in at most once. Since all edges in have unit weights and identical signatures, and by the definition of interestingness score in Equation (1), it is clear that must have edges. Hence must start with an edge and end with an edge . Then the directed cycle in defined by the edges , and the remaining edges in is Hamiltonian. Hence Max-IPD on directed graphs is NP-complete. ∎

### 3.2 Max-IP on directed acyclic graphs

###### Lemma 3.2.

Max-IP on a directed acyclic graph is in P.

###### Proof.

We present a polynomial time algorithm for Max-IP on a DAG (as proof of Lemma 3.2). The input is a DAG with vertices and edges, with edge weights and signatures for all . The output is an interesting path which has the maximum interestingness score in . We use dynamic programming, with the forward phase computing and the backtracking procedure reconstructing a corresponding .

Let denote the score of a maximum interesting path of length edges ending at edge for . Since an interesting path could be of length at most , we have . Therefore the values in the recurrence can be maintained in a 2-dimensional table of size (see Figure 1 for an illustration). The algorithm has three steps:

• Initialization: We initialize the first column of the table as follows.

 T(i,1)=ω(ei)×log(2)  for  i=1,…,m(∀ei∈E).
• Recurrence: For an edge , we define a predecessor edge of as any edge of the form and . Let denote the set of all predecessor edges of . We define the recurrence for as follows.

 T(i,j)=maxei′∈Pred(ei){T(i′,j−1)+ω(ei)×log(1+j)}, j=2,…,n−1. (2)
• Output: Let . Then the maximum interestingness score for Max-IP is . We obtain an optimal path with by backtracking from .

Proof of correctness: Any interesting path in can be at most edges long. As a particular edge could appear anywhere along such a path, its rank ranges from to . Hence the recurrence table (see Figure 1) is sufficient to capture all possibilities for each edge in . We make the following observation about the structure of maximum interesting paths identified by the dynamic programming algorithm, which guarantees its correctness.

###### Lemma 3.3.

Let be a maximum interesting path of length ending at edge  . Then is a maximum interesting path of length ending at edge for each .

###### Proof.

The edge can appear as the rank edge in an interesting path only if there exists another interesting path of length ending at one of its predecessor edges. The graph is a DAG, and we select a maximum scoring path among such predecessor paths for extension in the recurrence in Equation (2). Since all edge weights are nonnegative, the interestingness score of any such path computed by Equation (1) is a nondecreasing function. Hence the optimality of as an interesting path ending at is guaranteed for each . ∎

Complexity analysis: The above dynamic programming algorithm can be implemented to run in space and a worst-case time complexity of , where denotes the maximum indegree of any vertex in . We can initialize the table containing rows and columns, and compute the values in the table one column at a time, starting from the first column to the last column. Since the maximum number of predecessors of an edge is bounded by , the cost of computing each cell in the table is . Therefore, the overall runtime complexity is . Since in any directed graph, we get a worst-case time complexity of , showing that Max-IP on a DAG is in P. At the same time, could be much smaller than in specific cases. In the case where is a constant, the algorithm runs in time. ∎

#### Algorithmic improvements

The dynamic programming algorithm for Max-IP for DAGs can be implemented to run in space and time smaller in practice than the worst case limits suggested above. First, we note that computing the full table is likely to be wasteful, as it is likely to be sparse in practice. The sparsity of follows from the observation that an interesting path of length ending at edge can exist only if there exists at least one other interesting path of length ending at one of ’s predecessor edges. We can exploit this property by designing an iterative implementation as follows.

Instead of storing the entire table , we store only the rows (edges), and introduce columns on a “need basis” by maintaining a dynamic list of column indices for each edge .

1. Initially, we assign , as each edge is guaranteed to be in an interesting path of length at least (the path consisting of the edge by itself).

2. In general, the algorithm performs multiple iterations within each of which we visit and update the dynamic lists for all edges in as follows. For every edge , . The algorithm iterates until there is no further change in the list values for any of the edges.

The number of iterations in the above implementation can be bounded by the length of the longest path in the DAG, which is less than . Also, we implement the list update from predecessors to successors such that each edge is visited only a constant number of times (despite the varying in-degree to out-degree products at different vertices). To this end, we implement the update in S2 as a two step process: first, performing a union of all lists from the predecessor edges of the form so that the merged lists can be used to update the lists of all the successor edges of the form . Thus the work in each iteration is bounded by .

Taken together, even in the worst-case where there are iterations, the overall time to construct these dynamic lists is . Furthermore, during the list construction process, if one were to carefully store the predecessor locations using pointers, then the computation of the recurrence in each cell can be executed in time proportional to the number of non-empty predecessor values in the table. Overall, this revised algorithm can be implemented to run in time where is the diameter (length of the longest path) of the DAG, and in space proportional to the number of non-zero values in the matrix.

Further, the above iterative implementation is also inherently parallel since the list value at an edge in the current iteration depends only on the list values of its predecessors from the previous iteration. Therefore, we can implement the algorithm in a parallel fashion, further enhancing its efficiency in practice.

## 4 The k-Ip Problem

The goal of -IP for is to find a set of edge-disjoint interesting -paths such that the sum of their interestingness scores is maximized. We show that -IP on directed graphs can be solved in polynomial time for . On the other hand, we show that -IP is NP-Complete for . Further, we show that -IP remains NP-complete for even over a DAG.

### 4.1 k-Ip on directed graphs

The smallest value of for which -IP is nontrivial is . We can solve -IP as a weighted matching problem.

###### Lemma 4.1.

-IPD on directed graph is in P for .

###### Proof.

The case of turns out to be trivial. An optimal solution for -IP is obtained by taking as a collection of interesting -paths each comprised of a single edge. These -paths are edge-disjoint by definition, and the need to compare signatures within a path does not arise. Since all edge weights , the total interestingness score is guaranteed to be maximum. This optimal solution is unique when for all edges .

We model the -IP problem () as an equivalent weighted matching problem on an undirected graph , which we construct as follows. We include a vertex for each edge in the input graph. Hence . Whenever edges form an interesting -path in , we add the undirected edge with its weight computed using Equation (1). If both interesting paths and are possibly formed by a pair of edges , we set . Notice that for all edges , and . A matching in corresponds to a set of edge-disjoint interesting -paths in —a vertex will be matched with at most one other vertex , and such a match of vertices in corresponds to the interesting path (or , but not both). It follows that a maximum matching in corresponds to an optimal solution to -IP on the input graph .

The maximum weighted matching problem on an undirected graph with vertices and edges can be solved in strongly polynomial time—e.g., Gabow’s implementation [7] of Edmonds’ algorithm [6] runs in time. As such, we can solve -IP by solving the weighted matching problem on the associated graph in time. Hence -IP is in P for . ∎

We now consider -IP for . To characterize its complexity, we study the decision version of -IP termed -IPD, in which we are given a directed graph with edge weights and signatures for and a target score . The goal is to determine if there exists a collection of edge-disjoint interesting -paths in whose total interestingness score .

###### Theorem 4.2.

-IPD on directed graph is NP-complete for .

###### Proof.

Given a collection of interesting -paths in a directed graph , we can verify they are edge-disjoint, each path has edges, and signatures are identical for all edges in each path, all in polynomial time. We can compute the interestingness score of each -path using Equation (1) also in polynomial time, and add the values to compare with to check for equality. Hence -IPD is in NP.

We now reduce the exact -cover problem (XC) to -IPD. We then show a similar reduction for as well, proving -IPD is NP-complete for . The latter case for general subsumes the case for . We still present the details for separately, as this case reveals the structure of the general reduction in an arguably simpler setting. XC is a version of one of the 21 NP-complete problems originally introduced by Karp [11], and is defined as follows. Given a set with elements and a collection of -element subsets of with , determine if there exists a subset such that each element of belongs to exactly one member of . Notice that such an exact cover must necessarily have exactly members. Also, we assume (else the instance will be trivial).

Given an instance of XC, we create a directed graph for an instance of -IPD as follows. Each element corresponds to a unique directed edge in . Corresponding to each -element set , we add to a directed graph object as shown in Figure 2. The edges corresponding to all are assigned the large weight making them the “heavy” edges, while the rest of the edges are all assigned unit weights. Further, we assume is identical for all edges . The three “V”-shaped -paths in the top of Figure 2 are referred to as the -, -, and -paths. Notice that by this construction, can have at most vertices and edges.

Let , and . From a graph object as shown above, we observe that edge-disjoint interesting -paths can be chosen by -IPD each with interestingness score if and only if the -, -, and -paths are chosen along with other -paths as shown in Figure 3. Further, each edge corresponding to an element in may belong to only one -path. Thus, at most such graph objects may contribute the score of to the total interestingness score. The remaining graph objects may contribute a score of at most corresponding to the selection of the edge-disjoint interesting -paths shown in Figure 3, which avoid the edges corresponding to any . If such graph objects do contribute each to the total interestingness score, it is clear that the corresponding triplet elements in form an exact -cover of . Further, -IPD on will identify exactly edge-disjoint interesting -paths with a total interestingness score of exactly .

Conversely, if XC has an exact -cover , we choose the edge-disjoint interesting -paths (recall we assume identical signatures for all edges in ) each with interestingness score as described above in the graph object in corresponding to each of the -element sets in . For the -element sets in , we choose the interesting -paths each with interestingness score in the corresponding graph objects in . This collection of edge-disjoint interesting -paths in will have a total interestingness score of exactly .

Thus has an exact -cover if and only if -IPD on has a target total interestingness score of , proving -IPD is NP-complete.

We now extend this result to -IPD for . To this end, we reduce the exact -cover problem (XC) to -IPD for general . The XC problem is a generalization of XC, and is defined as follows. Given a set with elements and a collection of -element subsets of with , determine if there exists a subset such that each element of belongs to exactly one member of . Notice that such an exact cover must necessarily have exactly members. Also, we assume (else the instance will be trivial).

Given an instance of XC, we create a directed graph for an instance of -IPD as follows. Each element corresponds to a unique directed edge in . For each -element set , we add to a corresponding directed graph object as shown in Figure 4. The edges corresponding to all are assigned the large weight (giving “heavy” edges), while the rest of the edges are all assigned unit weights. Further, we assume is identical for all edges . The “V”-shaped -paths in the top of Figure 4 are referred to as the -, -,-paths. Notice that by this construction, can have at most vertices and edges.

Let , and . From a graph object as shown above, we observe that edge-disjoint interesting -paths can be chosen by -IPD each with interestingness score if and only if the -, -, -paths are chosen along with other -paths as shown in Figure 5. Further, each edge corresponding to an element in may belong to only one -path. Thus, at most such graph objects may contribute the score of each to the total interestingness score. The remaining graph objects may contribute a score of at most each corresponding to the selection of the edge-disjoint interesting -paths shown in Figure 5, which avoid the edges corresponding to any . If such graph objects do contribute each to the total interestingness score, it is clear that the corresponding -tuple elements in form an exact -cover of . Further, -IPD on will identify exactly edge-disjoint interesting -paths with a total interestingness score of exactly .

Conversely, if XC has an exact -cover , we choose the edge-disjoint interesting -paths (again, we assume identical signatures for all edges in ) each with interestingness score as described above in the graph object in corresponding to each of the -element sets in . For the -element sets in , we choose the edge-disjoint interesting -paths each with interestingness score in the corresponding graph objects in . This collection of edge-disjoint interesting -paths in will have a total interestingness score of exactly .

Thus has an exact -cover if and only if -IPD on has a target total interestingness score of , proving -IPD is NP-complete for . ∎

### 4.2 The k-Ip problem on directed acyclic graphs

We saw (in Section 3.2) that Max-IP is polynomial time solvable over DAGs even though the problem is NP-complete in general. However, we do not get an analogous result for -IP. We will use a modification of the construction used in Theorem 4.2 to show that -IPD is NP-complete over DAGs as well.

###### Lemma 4.3.

-IPD on a directed acyclic graph is NP-complete for .

###### Proof.

We follow the same arguments used in the Proof of Theorem 4.2, and first reduce XC to -IPD on a DAG. Here we construct an acyclic graph object corresponding to the -element set by removing the cyclic pairs of edges along with the -paths at their bottom connecting the three “V”-shaped interesting -paths in the top layer and the interesting -path in the base layer of Figure 2. The choices of -paths in these objects corresponding to sets in and in are shown in Figures 6 and 6, respectively. The result that -IPD on a DAG is NP-complete follows with , and .

We extend the result to by reducing XC to -IPD on a DAG. The directed acyclic graph object corresponding to each -element set is constructed now by removing the cyclic pair of edges along with the -paths at their bottom connecting the “V”-shaped interesting -paths at the top and the interesting -path in the base layer of Figure 4. The result that -IPD on a DAG is NP-complete for follows with , , and . ∎

## 5 The Interesting Paths (Ip) Problem

The goal of IP is to find a set of edge-disjoint interesting paths of possibly varying lengths ( to ) which cover all the edges, such that the sum of their interestingness scores is maximized. Based on the hardness results for -IP (Section 4), we conjecture that the IP is also intractable. We develop an efficient heuristic for IP on DAGs by employing the exact algorithm for Max-IP on DAGs (in Section 3.2) as a subroutine. We also estimate lower and upper bounds on the maximum total interestingness score of IP (Section 5.2).

### 5.1 An efficient heuristic for Ip on DAGs

We present a polynomial time heuristic to find a set of edge-disjoint interesting paths in a DAG with high total interestingness score. We do not provide any (approximation) guarantee on the optimality or quality of the collection of interesting paths .

Our method, termed Algorithm 1, uses a greedy strategy by iteratively calling the exact algorithm for Max-IP (Section 3.2). The idea is to iteratively detect a maximum interesting path, add it to the working set of solutions, remove all the edges in that path, and re-solve Max-IP on the remaining graph, until there are no more edges left.

Complexity Analysis: The runtime to compute Max-IP on in the first step is , as described in the proof of Lemma 3.1. Therefore, if we denote to be the number of iterations (hence the number of interesting paths found), then the overall runtime complexity is . However, we expect the performance of the algorithm in practice to be much faster. Note that at least one edge is, and at most edges are, eliminated in each iteration, thereby implying here.

Consider the worst case of elimination where one edge is eliminated in each iteration, i.e., ). The graph must be very sparse in this case, i.e., , causing our algorithm for Max-IP to perform only work per iteration. Therefore the overall runtime is , or equivalently, .

On the other hand, consider the case where around edges are eliminated in every iteration for some constant . This setting implies , while the work performed from one iteration to the next will continue to reduce by a factor of . Hence the overall runtime can still be bounded by , the cost of Max-IP. Further, from an application standpoint, such a greedy iterative approach can be terminated whenever an adequate number of “top” interesting paths are identified.

#### An efficient heuristic for k-Ip on DAGs

The above heuristic for IP can be easily modified to devise a heuristic for -IP on DAGs. Algorithm 2 summarizes the main steps. The main idea is to modify the exact algorithm for Max-IP on a DAG such that it initializes a recurrence table of size , and then use that table to iteratively compute Max-IP paths. The only constraint here is that each such Max-IP path should originate from the -th column during backtracking, so that paths output are guaranteed to be of length . The runtime is bounded by .

### 5.2 Bounds for Ip

Let represent an optimal set of paths for an instance of IP. We derive upper and lower bounds on its total interestingness score . Let denote a maximum interesting path (of arbitrary length) ending at an arbitrary edge .

.

###### Proof.

Consider an arbitrary path . We first note that individual paths that are members of an optimal solution () for the IP problem can end at any arbitrary non-source vertex in (see Figure 7 for an illustration).

Without loss of generality, let us assume that the input graph contains only vertices with degree at least one (as vertices with degree zero cannot contribute to any interesting path). We consider two sub-cases:

Case A: No two maximum scoring paths ending at two different edges and intersect, i.e., , and .
This case can occur only if the number of edges () is equal to the number of source vertices. This setting implies that is comprised of paths, where each path is a unique edge . Therefore, in this case.

Case B: There exists at least two maximum interesting paths ending at two different edges and that intersect, i.e., .
This case implies that at least one of these two paths is not a member of (by definition of the IP problem); let this non-member path be without loss of generality. Since all edges are covered by by definition of IP, there still has to exist an alternative path ending in that is either directly contained in or contained as a subpath of a longer path in ; let us refer to this alternative path as . Since is an optimal interesting path ending at edge , . In other words, the contribution of to cannot exceed the contribution of to . Therefore, in this case as well. ∎

We now present a lower bound for .

.

###### Proof.

Since covers all edges in , a trivial (albeit not necessarily optimal) solution for IP can be constructed by including every edge as a distinct interesting path in the graph, i.e., . Therefore, follows from Equation (1). ∎

## 6 Discussion

We have proposed a general framework for quantifying the significance of features in the Mapper in terms of their interestingness scores. The associated optimization problems Max-IP, -IP, and IP constitute a new class of nonlinear longest path problems on directed graphs. We have not characterized the complexity of problem IP. Judging from the fact that -IP is NP-complete (even on DAGs), we suspect IP is NP-complete as well.

Our framework for quantifying interesting paths could be modified to quantify branches in flares as well as holes. When the graph is a DAG, two interesting -paths that start and end at the same pair of vertices could be characterized as a -hole, i.e., a cycle with edges. One could alternatively use persistent homology tools to characterize holes—by identifying “long” generators around them. A 2-way branching in flares could be identified by two interesting -paths where one path starts off from a vertex at or near the start of the other path. Alternatively, we could generalize the definition of interestingness score for a path in Equation (1) to that of a 2-way branch. Subsequently, we could seek to solve the related optimization problems of identifying the most interesting -way flare, or to identify a collection of -way flares whose total interestingness score is maximized.

While we distinguished the clustering function from the filter functions for , this distinction is not critically used in our framework. As indicated in Remark 2.6, one could use simultanously as a filter function along with the ’s, and the overall analysis should still carry through. More generally, details of how to implement clustering within the Mapper framework has not received much research attention. In initial work on phenomics [10], we obtained better results when using alone to cluster within Mapper (rather than clustering using several, or even all, of the variables). It would be interesting to characterize the stability of Mapper to varying settings of clustering employed in its construction. For instance, could we identify a “small” subset of variables for use in clustering within Mapper that is optimal in a suitable sense?

While we have proposed an efficient heuristic for IP on DAGs, we are not able to certify the quality of solution obtained by this method. On the other hand, could we devise approximation algorithms for IP or -IP? One might have to work under some simplifying assumptions on the distribution of weights or on the structure of the graph . The simplest case to consider appears to that of IP on a DAG with unit weights on all edges.

We study interestingness of features in a given single Mapper. A natural extension to consider would be to characterize the stability of the highly interesting features. Could we incorporate our interestingness scores into the mathematical machinery recently developed to obtain results on stability and statistical convergence of the 1-D Mapper [2, 3]?

Acknowledgments: This research is supported by the NSF grant DBI-1661348. Krishnamoorthy thanks Frédéric Meunier for discussion on the complexity of Max-IP while visiting MSRI.

### Footnotes

1. School of Electrical Engineering and Computer Science, Pullman, WA, 99164, USA; ananth@eecs.wsu.edu
2. School of Electrical Engineering and Computer Science, Pullman, WA, 99164, USA; methun@eecs.wsu.edu
3. Department of Mathematics and Statistics, Vancouver, WA, 98686, USA; bkrishna@math.wsu.edu
4. All references to a “path” in this paper imply a simple path, i.e., no vertices are repeated.

### References

1. Muthu Alagappan. From 5 to 13: Redefining the positions in basketball. In MIT Sloan Sports Analytics Conference, 2012.
2. Mathieu Carrière, Bertrand Michel, and Steve Oudot. Statistical analysis and parameter selection for mapper. 2017.
3. Mathieu Carrière and Steve Oudot. Structure and stability of the one-dimensional mapper. Foundations of Computational Mathematics, Oct 2017.
4. Tamal K. Dey, Facundo Mémoli, and Yusu Wang. Multiscale mapper: Topological summarization via codomain covers. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, pages 997–1013, Philadelphia, PA, USA, 2016. Society for Industrial and Applied Mathematics.
5. Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simplification. Discrete and Computational Geometry, 28:511–533, 2002.
6. Jack R. Edmonds. Maximum matching and a polyhedron with ,-vertices. Journal of Research of the National Bureau of Standards Section B, 69:125–130, 1965.
7. Harold N. Gabow. Data structures for weighted matching and nearest common ancestors with linking. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’90, pages 434–443, Philadelphia, PA, USA, 1990. Society for Industrial and Applied Mathematics.
8. Geoffrey M. Guisewite and Panos M. Pardalos. Algorithms for the single-source uncapacitated minimum concave-cost network flow problem. Journal of Global Optimization, 1(3):245–265, 1991.
9. Timothy S.C. Hinks, Xiaoying Zhou, Karl J. Staples, Borislav D. Dimitrov, Alexander Manta, Tanya Petrossian, Pek Y. Lum, Caroline G. Smith, Jon A. Ward, Peter H. Howarth, Andrew F. Walls, Stephan D. Gadola, and Ratko DjukanoviÄ. Innate and adaptive T cells in asthmatic patients: Relationship to severity and disease mechanisms. Journal of Allergy and Clinical Immunology, 136(2):323–333, 2015.
10. Methun Kamruzzaman, Ananth Kalyanaraman, Bala Krishnamoorthy, and Patrick Schnable. Toward a scalable exploratory framework for complex high-dimensional phenomics data. 2017. Under review; arXiv:1707.04362.
11. Richard M. Karp. Reducibility Among Combinatorial Problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85–103. Plenum Press, 1972.
12. Li Li, Wei-Yi Cheng, Benjamin S. Glicksberg, Omri Gottesman, Ronald Tamler, Rong Chen, Erwin P. Bottinger, and Joel T. Dudley. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Science Translational Medicine, 7(311):311ra174–311ra174, 2015.
13. Pek Y. Lum, Gurjeet Singh, Alan Lehman, Tigran Ishkanov, Mikael. Vejdemo-Johansson, Muthi Alagappan, John G. Carlsson, and Gunnar Carlsson. Extracting insights from the shape of complex data using topology. Scientific Reports, 3(1236), 2013.
14. Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.
15. Jessica L. Nielson, Jesse Paquette, Aiwen W. Liu, Cristian F. Guandique, C. Amy Tovar, Tomoo Inoue, Karen-Amanda Irvine, John C. Gensel, Jennifer Kloke, Tanya C. Petrossian, Pek Y. Lum, Gunnar E. Carlsson, Geoffrey T. Manley, Wise Young, Michael S. Beattie, Jacqueline C. Bresnahan, and Adam R. Ferguson. Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury. Nature Communications, 6:8581+, October 2015.
16. M. Rucco, E. Merelli, D. Herman, D. Ramanan, T. Petrossian, L. Falsetti, C. Nitti, and A. Salvi. Using topological data analysis for diagnosis pulmonary embolism. Journal of Theoretical and Applied Computer Science, 9:41–55, 2015.
17. Ghanashyam Sarikonda, Jeremy Pettus, Sonal Phatak, Sowbarnika Sachithanantham, Jacqueline F. Miller, Johnna D. Wesley, Eithon Cadag, Ji Chae, Lakshmi Ganesan, Ronna Mallios, Steve Edelman, Bjoern Peters, and Matthias von Herrath. CD8 T-cell reactivity to islet antigens is unique to type 1 while CD4 T-cell reactivity exists in both type 1 and type 2 diabetes. Journal of Autoimmunity, 50(Supplement C):77–82, 2014.
18. Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In M. Botsch, R. Pajarola, B. Chen, and M. Zwicker, editors, Proceedings of the Symposium on Point Based Graphics, pages 91–100, Prague, Czech Republic, 2007. Eurographics Association.
19. Brenda Y. Torres, Jose Henrique M. Oliveira, Ann Thomas Tate, Poonam Rath, Katherine Cumnock, and David S. Schneider. Tracking resilience to infections by mapping disease space. PLoS Biol, 14(4):1–19, 04 2016.
20. George Tsaggouris and Christos Zaroliagis. Non-additive Shortest Paths, pages 822–834. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
21. Hoang Tuy, Saied Ghannadan, Athanasios Migdalas, and Peter Värbrand. The minimum concave cost network flow problem with fixed numbers of sources and nonlinear arc costs. Journal of Global Optimization, 6(2):135–151, Mar 1995.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters