Deep Graph Mapper

Deep Graph Mapper


Recent advancements in graph representation learning have led to the emergence of condensed encodings that capture the main properties of a graph. However, even though these abstract representations are powerful for downstream tasks, they are not equally suitable for visualisation purposes. In this work, we merge Mapper, an algorithm from the field of Topological Data Analysis (TDA), with the expressive power of Graph Neural Networks (GNNs) to produce hierarchical, topologically-grounded visualisations of graphs. These visualisations do not only help discern the structure of complex graphs but also provide a means of understanding the models applied to them for solving various tasks. We further demonstrate the suitability of Mapper as a topological framework for graph pooling by mathematically proving an equivalence with Min-Cut and Diff Pool. Building upon this framework, we introduce a novel pooling algorithm based on PageRank, which obtains competitive results with state of the art methods on graph classification benchmarks.


1 Introduction

Tasks involving graph-structured data have received much attention lately, due to the abundance of relational information in the real world. Considerable progress has been made in the field of graph representation learning through generating graph encodings with the help of deep learning techniques. The abstract representations obtained by these models are typically intended for further processing within downstream tasks. However, few of these advancements have been directed towards visualising and aiding the human understanding of the complex networks ingested by machine learning models. We believe that data and model visualisation are important steps of the statistical modelling process and deserve an increased level of attention.

Figure 1: A Deep Graph Mapper (DGM) visualisation of a dense graph containing spammers and non-spammers (top-left). DGM removes the visual clutter in the original NetworkX plot (using a Graphviz ‘spring’ layout), by providing an informative summary of the graph. Each node in the DGM graph represents a cluster of nodes in the original graph. The size of the DGM nodes is proportional to the number of nodes in the corresponding cluster. Each edge signifies that two clusters have overlapping nodes proportional to the thickness of the edge. The clusters are mainly determined by a neural ‘lens’ function: a GCN performing binary node classification. The colorbar indicates the GCN predicted probability that a node is a spammer. The DGM visualisation illustrates important features of the graph: spammers (red) are highly inter-connected and consequently grouped in a just few large clusters, whereas non-spammers (blue) are less connected to the rest of the graph and thus form many small clusters.

Here, we tackle this problem by merging Mapper (Singh et al., 2007), an algorithm from the field of Topological Data Analysis (TDA) (Chazal and Michel, 2017), with the demonstrated representational power of Graph Neural Networks (GNNs) (Scarselli et al., 2008; Battaglia et al., 2018; Bronstein et al., 2017) and refer to this synthesis as Deep Graph Mapper (DGM). Our method offers a means to visualise graphs and the complex data living on them through a GNN ‘lens’. Moreover, the aspects highlighted by the visualisation can be flexibly adjusted via the loss function of the network. Finally, DGM achieves progress with respect to GNN interpretability, providing a way to visualise the model and identify the mistakes it makes for node-level supervised and unsupervised learning tasks.

Figure 2: A cartoon illustration of The Deep Graph Mapper (DGM) algorithm where, for simplicity, the GNN approximates a ‘height’ function over the nodes in the plane of the diagram. The input graph (a) is passed through a Graph Neural Network (GNN), which maps the vertices of the graph to a real number (the height) (b). Given a cover of the image of the GNN (c), the refined pull back cover is computed (d–e). The 1-skeleton of the nerve of the pull back cover provides the visual summary of the graph (f). The diagram is inspired from Hajij et al. (2018a).

We then demonstrate that Mapper graph summaries are not only suitable for visualisation, but can also constitute a pooling mechanism within GNNs. We begin by proving that Mapper is a generalisation of pooling methods based on soft cluster assignments, which include state-of-the-art algorithms like minCUT (Bianchi et al., 2019) and DiffPool (Ying et al., 2018). Building upon this topological perspective, we propose MPR, a novel graph pooling algorithm based on PageRank (Page et al., 1999). Our method obtains competitive or superior results when compared with state-of-the-art pooling methods on graph classification benchmarks. To summarise, our contributions are threefold:

  • DGM, a topologically-grounded method for visualising graphs and the GNNs applied to them.

  • A proof that Mapper is a generalisation of soft cluster assignment pooling methods, including the state-of-the-art minCUT and Diff pool.

  • MPR, a Mapper-based pooling method that achieves similar or superior results compared to state-of-the-art methods on several graph classification benchmarks.

2 Related Work

2.1 Graph Visualisation and Interpretability

A number of software tools exist for visualising node-link diagrams: NetworkX (Hagberg et al., 2008), Gephi (Bastian et al., 2009), Graphviz (Gansner and North, 2000) and NodeXL (Smith et al., 2010). However, these tools do not attempt to produce condensed summaries and consequently suffer on large graphs from the visual clutter problem illustrated in Figure 1. This makes the interpretation and understanding of the graph difficult. Some of the earlier attempts to produce visual summaries rely on grouping nodes into a set of predefined motifs (Dunne and Shneiderman, 2013) or compressing them in a lossless manner into modules (Dwyer et al., 2013). However, these mechanisms are severely constrained by the simple types of node groupings that they allow.

Mapper-based summaries for graphs have recently been considered by Hajij et al. (2018a). However, despite the advantages provided by Mapper, their approach relies on hand-crafted graph-theoretic ‘lenses’, such as the average geodesic distance, graph density functions or eigenvectors of the graph Laplacian. Not only are these methods rigid and unable to adapt well to the graph or task of interest, but they are also computationally inefficient. Moreover, they do not take into account the features of the graph. In this paper, we build upon their work by considering learnable functions (GNNs) that do not present these problems.

Mapper visualisations are also an indirect way to analyse the behaviour of the associated ‘lens’ function. However, visualisations that are directly oriented towards model interpretability have been recently considered by Ying et al. (2019), who propose a model capable of indicating the relevant sub-graphs and features for a given model prediction.

2.2 Graph Pooling

Pooling algorithms have already been considerably explored within GNN frameworks for graph classification. Luzhnica et al. (2019a) propose a topological approach to pooling which coarsens the graph by aggregating its maximal cliques into new clusters. However, cliques are local topological features, whereas our MPR algorithm leverages a global perspective of the graph during pooling. Two paradigms distinguish themselves among learnable pooling layers: top- pooling based on a learnable ranking, initially adopted by Gao and Ji (2019) (Graph U-Nets), and learning the cluster assignment (Ying et al., 2018) with additional entropy and link prediction losses for more stable training (DiffPool). Following these two trends, several variants and incremental improvements have been proposed. The top- approach is explored in conjunction with jumping-knowledge networks (Cangea et al., 2018), attention (Lee et al., 2019; Huang et al., 2019) and self-attention for cluster assignment (Ranjan et al., 2019). Similarly to DiffPool, the method suggested by Bianchi et al. (2019) uses several loss terms to enforce clusters with strongly connected nodes, similar sizes and orthogonal assignments. A different approach is also proposed by Ma et al. (2019), who leverage spectral clustering for pooling.

2.3 Topological Data Analysis in Machine Learning

Persistent homology (Edelsbrunner and Harer, 2008) has been so far the most popular branch of TDA applied to machine learning and graphs, especially. Hofer et al. (2017) and Carriere et al. (2019) integrated graph persistence diagrams with neural networks to obtain topology-aware models. A more advanced approach for GNNs has been proposed by Hofer et al. (2019), who backpropagate through the persistent homology computation to directly learn a graph filtration.

Mapper, another central algorithm in TDA, has been used in deep learning almost exclusively as a tool for understanding neural networks. Gabella et al. (2019) use Mapper to visualise the evolution of the weights of fully connected networks during training, while Gabrielsson and Carlsson (2018) use it to visualise the filters computed by CNNs. To the best of our knowledge, the paper of Hajij et al. (2018a) remains the only application of Mapper on graphs.

3 Mapper for Visualisations

In this section we describe the proposed integration between Mapper and GCNs.

3.1 Mapper on Graphs

We start by reviewing Mapper (Singh et al., 2007), a topologically-motivated algorithm for high-dimensional data visualisation. Intuitively, Mapper obtains a low-dimensional image of the data that can be easily visualised. The algorithm produces an output graph that shows how clusters within the data are semantically related from the perspective of a ‘lens’ function. The resulting graph preserves the notion of ‘nearness’ of the input topological space, but it can compress large scale distances by connecting far-away points that are similar according to the lens. We first introduce the required mathematical background.

Definition 3.1.

An open cover of a topological space is a collection of open sets , for some indexing set , whose union includes .

For example, an open cover for the real numbers could be . Similarly, is an open cover for a set of vertices .

Definition 3.2.

Let be a topological space, a continuous function, and a cover of . Then, the pull back cover of induced by is the collection of open sets , for some indexing set , where by we denote the preimage of the set .

Given a dataset , a carefully chosen lens function and cover , Mapper first computes the associated pull back cover . Then, using a clustering algorithm of choice, it clusters each of the open sets in . The resulting group of sets is called the refined pull back cover, denoted by with indexing set . Concretely, in this paper, the input dataset is a weighted graph and is a function over the vertices of the graph. An illustration for these steps is provided in Figure 2 (a-e) for a ‘height’ function .

Finally, Mapper produces a new graph by taking the -skeleton of the nerve of the refined pull back cover: a graph where the vertices are given by and two vertices are connected if and only if . Informally, the soft clusters formed by the refined pull back become the new nodes and clusters with common nodes are connected by an edge. This final step is illustrated in Figure 2 (f).

Three main degrees of freedom that determine the visualisation can be distinguished within Mapper:

The lens : In our case, the lens is a function over the vertices, which acts as a filter that emphasises certain features of the graph. The choice of highly depends on the properties to be highlighted by the visualisation. We also refer to the co-domain of as the parametrisation space.

The cover : The choice of cover determines the resolution of the summary. Fine-grained covers will produce more detailed visualisations, while higher overlaps between the sets in the cover will increase the connectivity of the output graph. When , a common choice is to select a set of overlapping intervals over the real line, as in Figure 2 (c).

Clustering algorithm: Mapper works with any clustering algorithm. When the dataset is a graph, a natural choice adopted by Hajij et al. (2018a) is to take the connected components of the subgraphs induced by the vertices (Figure 2 (e-f)). This is also the approach we follow, unless otherwise stated.

3.2 Seeing through the Lens of GCNs

As mentioned in Section 2.1, Hajij et al. (2018b) have considered a set of graph theoretic functions for the lens. However, with the exception of PageRank, all of these functions are difficult to compute on large graphs. The average geodesic distance and the graph density function require computing the distance matrix of the graph, while the eigenfunction computations do not scale beyond graphs with a few thousands of nodes. Besides their computational complexity, many real-world graphs contain features within the nodes and edges of the graphs, which are ignored by graph-theoretic summaries.

In this work, we leverage the recent progress in the field of graph representation learning and propose a series of lens functions based on Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016). We refer to this integration between Mapper and GCNs as Deep Graph Mapper (DGM).

Unlike graph-theoretic functions, GCNs can naturally learn and integrate the features associated with the graph and its topological properties, while also scaling to large, complex graphs. Additionally, visualisations can flexibly be tuned for the task of interest by adjusting the associated loss function.

Mapper further constitutes a method for implicitly visualising the lens function, so DGM is also a novel approach to model understanding. Figure 4 illustrates how Mapper can be used to identify mistakes that the underlying GCN model makes in node classification. This showcases the potential of DGM for continuous model refinement.

3.3 Supervised Lens

A natural application for DGM in the supervised domain is as an assistive tool for binary node classification. The node classifier, a function , is an immediate candidate for the DGM lens. The usual choice for a cover over the real numbers is a set of equally-sized overlapping intervals, with an overlap percentage of . This often produces a hierarchical (tree-like) perspective over the input graph, as shown in Figure 1.

Figure 3: DGM visualisation of the Cora dataset, using a GCN classifier as the lens and a grid-cover with 9 cells and 10% overlap across each axis. The two-dimensional cover is able to capture more complex relations of semantic proximity, such as high-order cliques. We use a two-dimensional colormap (bottom-right) to reflect the relationship between the nodes and their position in the parametrisation space.

However, most node-classification tasks involve more than two labels. A simple solution in this case is to use a dimensionality reduction algorithm such as -SNE (van der Maaten and Hinton, 2008) to embed the logits in , where is small. Empirically, when the number of classes is larger than two, we find a 2D parametrisation space to better capture the relationships between the classes in the semantic space. Figure 3 includes a visualisation of the Cora dataset using a -SNE embedding of the logits. Faster dimensionality reduction methods such as PCA (Jolliffe, 2002) or the recently proposed NCVis (Artemenkov and Panov, 2020) could be used to scale this approach to large graphs.

(a) DGM visualisation.
(b) DGM visualisation with ground-truth colouring.
Figure 4: Side-by-side comparison of the DGM visualisation and the one with ground-truth labelling. The two images clearly highlight certain mistakes that the classifier is making, such as the many small nodes in the bottom-left corner that the classifier tends to label as spam (light-red), even though most of them are non-spam users (in blue).

In a supervised setting, DGM visualisations can also be integrated with the ground-truth labels of the nodes. The latter provide a means for visualising both the mistakes that the classifier is making and the relationships between classes. For lens trained to perform binary classification of the nodes, we colour the nodes of the summary graph proportionally to the number of nodes in the original graph that are part of that cluster and belong to the positive class. For lens that are multi-label classifiers, we colour each node with the most frequent class in its corresponding cluster. Figure 4 gives the labelled summary for the binary Spam dataset, while Figure 7 includes two labelled visualisations for the Cora and CiteSeer datasets.

3.4 Unsupervised Lens

The expressive power of GNNs is not limited to supervised tasks. In fact, many graphs do not have any labels associated with their nodes—in this case, the lens described so far, which require supervised training, could not be applied. However, the recent models for unsupervised graph representation learning constitute a powerful alternative.

Here, we use Deep Graph Infomax (DGI) (Veličković et al., 2018) to compute node embeddings in and obtain a low-dimensional parametrisation of the graph nodes. DGI computes node-level embeddings by learning to maximise the mutual information (MI) between patch representations and corresponding high-level summaries of graphs. We have empirically found that applying -SNE over a higher-dimensional embedding of DGI works better than learning a low-dimensional parametrisation with DGI directly. Figure 7 includes two labelled visualisations obtained with the unsupervised DGI lens on Cora and CiteSeer.

Figure 5: By adjusting the cover , multi-resolution visualisations of the graph can be obtained. The multi-resolution perspective can help one identify the persistent features of the graph (i.e. the features that survive at multiple scales). For instance, the connection between the grey and blue nodes does not survive at all resolutions, whereas the connection between the red and orange nodes persists.

3.5 Hierarchical Visualisations

One of the biggest advantages of Mapper is the flexibility to adjust the resolution of the visualisation by modifying the cover . The number of sets in determines the coarseness of the output graph, with a larger number of sets generally producing a larger number of nodes. The overlap between these sets determines the level of connectivity between the nodes—for example, a higher degree of overlap produces denser graphs. By adjusting these two parameters, one can discover what sort of properties of the graph are persistent, consequently surviving at multiple scales, and which features are mainly due to noise. In Figure 5, we include multiple visualisations of the CiteSeer (Sen et al., 2008) dataset at various scales—these are determined by adjusting the cover via its size and interval overlap percentage .

3.6 The Dimensionality of the Output Space

Figure 6: DGM visualisation of Cora with a 3D parametrisation and ground-truth labels. Each colour represents a different class. While higher-dimensional parametrisations can encode more complex semantic relations, the interpretability of the lens becomes more difficult.

Although Mapper typically uses a one-dimensional parametric space, higher-dimensional ones have extra degrees of freedom in embedding the nodes inside them. Therefore, higher-dimensional spaces feature more complex neighbourhoods or relations of semantic proximity, as it can be seen from Figures 1 (1D), 3 (2D) and 6 (3D).

However, the interpretability of the lens decreases as increases, for a number of reasons. First, colormaps can be easily visualised in 1D and 2D, but it is hard to visualise a 3D-colormap. Therefore, for the 3D parametrisation in Figure 6, we use dataset labels to colour the nodes. Second, the curse of dimensionality becomes a problem and the open sets of the cover are likely to contain fewer and fewer nodes as increases. Therefore, the resolution of the visualisation can hardly be adjusted in higher dimensional spaces.

4 Mapper for Pooling

We now present a theoretical result showing that the graph summaries computed by Mapper are not only useful for visualisation purposes, but also as a pooling (graph-coarsening) mechanism inside GNNs. Building on this evidence, we introduce a pooling method based on PageRank and Mapper.

4.1 Mapper and Spectral Clustering

The relationship between Mapper for graphs and spectral clustering has been observed by Hajij et al. (2018a). This link is a strong indicator that Mapper can compute ‘useful’ clusters for pooling. We formally restate this observation below and provide a short proof.

Proposition 4.1.

Let be the Laplacian of a graph and the eigenvector corresponding to the second lowest eigenvalue of , also known as the Fiedler vector (Fiedler, 1973). Then, for a function , outputting the entry in the eigenvector corresponding to node and a cover , Mapper produces a spectral bi-partition of the graph for a sufficiently small positive .


It is well known that the Fiedler vector can be used to obtain a “good” bi-partition of the graph based on the signature of the entries of the vector (i.e. and ) (please refer to Demmel (1995) for a proof). Therefore, by setting to a sufficiently small positive number , the obtained pull back cover is a spectral bi-partition of the graph. ∎

The result above indicates that Mapper is a generalisation of spectral clustering. As the latter is strongly related to min-cuts (Leskovec, 2016), the proposition also provides a link between Mapper and min-cuts.

4.2 Mapper and Soft Cluster Assignments

Let be a graph with self-loops for each node and adjacency matrix . Soft cluster assignment pooling methods use a soft cluster assignment matrix , where is the number of nodes in the graph and is the number of clusters, and compute the new adjacency matrix of the graph via . Equivalently, two soft clusters become connected if and only if there is a common edge between them.

Proposition 4.2.

There exists a graph derived from , a soft cluster assignment of based on , and a cover of , such that the 1-skeleton of the nerve of the pull back cover induced by is isomorphic with the graph defined by .

This shows that Mapper is a generalisation of soft-cluster assignment methods. A detailed proof and a diagrammatic version of it can be found in the supplementary material. Note that this result implicitly uses an instance of Mapper with the simplest possible clustering algorithm that assigns all vertices in each open set of the pull back cover to the same cluster (same as no clustering).

We hope that this result will enable theoreticians to study pooling operators through the topological and statistical properties of Mapper (Carriere et al., 2018). At the same time, we encourage practitioners to take advantage of it and design new pooling methods in terms of a well-chosen lens function and cover for its image.

A remaining challenge is designing differentiable pull back cover operations to automatically learn a parametric lens and cover. We leave this exciting direction for future work and focus on exploiting the proven connection—we propose in the next section a simple, non-differentiable pooling method based on PageRank and Mapper that performs surprisingly well.

Additionally, we use this result to propose Structural DGM (SDGM), a version of DGM where edges represent structural connections between the clusters in the refined pull back cover, rather than semantic connections. SDGM can be found in Appendix B.

4.3 PageRank Pooling


For the graph classification task, each example G is represented by a tuple , where is the node feature matrix and is the adjacency matrix. Both our graph embedding and classification networks consist of a sequence of graph convolutional layers Kipf and Welling (2016); the -th layer operates on its input feature matrix as follows:


where is the adjacency matrix with self-loops, is the normalised node degree matrix and is the activation function.

After layers, the embedding network simply outputs node features , which are subsequently processed by a pooling layer to coarsen the graph. The classification network first takes as input node features of the Mapper-pooled (Section 4.3.2) graph1, , and passes them through graph convolutional layers. Following this, the network computes a graph summary given by the feature-wise node average and applies a final linear layer which predicts the class:


where is the number of nodes in the final graph.

(a) DGM on Cora
(b) Density Mapper on Cora
(c) DGI -SNE on Cora
(d) Graphviz layout on Cora
(e) DGM on CiteSeer
(f) Density Mapper on CiteSeer
(g) DGI -SNE on CiteSeer
(h) Graphviz layout on Citeseer
Figure 7: Qualitative comparison between DGM (a, e) and Mapper with an RBF graph density function (b, f). The DGI -SNE plot (c, g) and the Graphviz visualisation of the full graph (d, h) are added for reference. The first and second row show plots for Cora and CiteSeer, respectively. The nodes are coloured with the dataset node labels. DGM with unsupervised lens implicitly makes all dataset classes appear in the visualisation clearly separated, which does not happen in the density visualisation. DGM also adds a new layer of information relative to the -SNE plot by mapping the semantic information back to the original graph.

Mapper-based PageRank (MPR) Pooling

We now describe the pooling mechanism used in our graph classification pipeline, which we adapt from the Mapper algorithm. The first step is to assign each node a real number in , achieved by computing a lens function that is given by the normalised PageRank (PR) Page et al. (1999) of the nodes. The PageRank function assigns an importance value to each of the nodes based on their connectivity, according to the following recurrence relation:


where represents the set of neighbours of the -th node in the graph. The resulting scores are values in which reflect the probability of a random walk through the graph to end in a given node. Computing the vector is achieved using NetworkX (Hagberg et al., 2008) via power iteration, as is the principal eigenvector of the transition matrix of the graph:


where is a matrix with all elements equal to 1 and is the probability of continuing the random walk at each step; a value closer to implies the nodes would receive a more uniform ranking and tend to be clustered in a single node. We choose the widely-adopted and refer the reader to (Boldi et al., 2005) for more details.

We use the previously described overlapping intervals cover and determine the pull back cover induced by . This effectively builds a soft cluster assignment matrix from nodes in the original graph to ones in the pooled graph:


where is the -th overlapping interval in the cover of . It can be observed that the resulting clusters contain nodes with similar PageRank scores, as determined by eq. 3. Therefore, our pooling method intuitively merges the (usually few) highly connected nodes in the graph, and at the same time clusters the (typically many) dangling nodes that have a normalised PageRank score closer to zero.

Finally, the mapping is used to compute features for the new nodes (i.e. the soft clusters formed by the pull back), , and the corresponding adjacency matrix, .

It is important that graph classification models are node permutation invariant since one graph can be represented by any tuple , where is a node permutation. Bellow, we state a positive result in this regard for the MPR pooling procedure.

Proposition 4.3.

The PageRank pooling operator defined above is permutation-invariant.


First, we note that the PageRank function is permutation invariant and refer the reader to Altman (2005, Axiom 3.1) for the proof. It then follows trivially that the PageRank pooling operator is also permutation-invariant. ∎

5 Experiments

We now provide a qualitative evaluation of the DGM visualisations and benchmark MPR pooling on a set of graph classification tasks.

5.1 Tasks

The DGM visualiser is evaluated on two popular citation networks: CiteSeer and Cora (Sen et al., 2008). We further showcase the applicability of DGM within a pooling framework, reporting its performance on a variety of settings: social (Reddit-Binary), citation networks (Collab) and chemical data (D&D, Proteins) Kersting et al. (2016).

5.2 Qualitative Results

In this section, we qualitatively compare DGM using a DGI lens against a Mapper instance that uses a fine-tuned graph density function based on an RBF kernel (Figure 7) with being the distance matrix of the graph. For reference, we also include a full-graph visualisation using a Graphviz layout and a t-SNE plot of the DGI embeddings that are used as the lens.

5.3 Pooling Evaluation

We adopt a 10-fold cross-validation approach to evaluating the graph classification performance of MPR and other competitive state-of-the-art methods. The random seed was set to for all experiments (with respect to dataset splitting, shuffling and parameter initialisation), in order to ensure a fair comparison across architectures. All models were trained on a Titan Xp GPU, using the Adam optimiser Kingma and Ba (2014) with early stopping on the validation set, for a maximum of 30 epochs. We report the classification accuracy using 95% confidence intervals calculated for a population size of 10 (the number of folds).


We compare the performance of MPR to two other pooling approaches that we identify mathematical connections with—minCUT Bianchi et al. (2019) and DiffPool Ying et al. (2018). Additionally, we include Graph U-Net Gao and Ji (2019) in our evaluation, as it has been shown to yield competitive results, while performing pooling from the alternative perspective of a learnable node ranking; we denote this approach by Top- in the remainder of this section.

We optimise MPR with respect to its cover cardinality , interval overlap percentage at each pooling layer, learning rate and hidden size. The Top- architecture is evaluated using the code provided in the official repository2, where separate configurations are defined for each of the benchmarks. The minCUT architecture is represented by the sequence of operations described by Bianchi et al. (2019): MP(32)-pooling-MP(32)-pooling-MP(32)-GlobalAvgPool, followed by a linear softmax classifier. The MP(32) block represents a message-passing operation performed by a graph convolutional layer with 32 hidden units:


where is the symmetrically normalised adjacency matrix and are learnable weight matrices representing the message passing and skip-connection operations within the layer. The DiffPool model follows the same sequence of steps. Full details of the model architectures and hyperparameters can be found in the supplementary material.


The graph classification performance obtain by these models is reported in Table 1. By suitably adjusting MPR hyperparameters, we achieve the best results for D&D, Proteins and Collab and closely follow minCUT on Reddit-Binary. These results showcase the utility of Mapper for designing better pooling operators.

MPR Top- minCUT DiffPool
Table 1: Results obtained by optimised architectures on classification benchmarks. Accuracy measures with 95% confidence intervals are reported.

6 Conclusion and Future Work

We have introduced Deep Graph Mapper, a topologically-grounded method for producing informative hierarchical graph visualisations with the help of GNNs. We have shown these visualisations are not only helpful for understanding various graph properties, but can also aid in refining graph models. Additionally, we have proved that Mapper is a generalisation of soft cluster assignment methods, effectively providing a bridge between graph pooling and the TDA literature. Based on this connection, we have proposed a simple Mapper-based PageRank pooling operator, competitive with several state-of-the-art methods on graph classification benchmarks. Future work will focus on back-propagating through the pull back computation to automatically learn a lens and cover. Lastly, we plan to extend our methods to spatio-temporally evolving graphs.


We would like to thank Petar Veličković, Ben Day, Felix Opolka, Simeon Spasov, Alessandro Di Stefano, Duo Wang, Jacob Deasy, Ramon Viñas, Alex Dumitru and Teodora Reu for their constructive comments. We are also grateful to Teo Stoleru for helping with the diagrams.

Figure 8: An illustration of the proof for Proposition 4.2. Connecting the clusters connected by an edge in the original graph (a) is equivalent to constructing an expanded graph with an expanded cluster assignment (b), doing a 1-hop expansion of the expanded cluster assignment (c) and finally taking the 1-skeleton of the nerve (d) of the cover obtained in (c).

Appendix A: Proof of Proposition 4.2

Throughout the proof we consider a graph with self loops for each of its nodes. The self-loop assumption is not necessary, but it elegantly handles a number of degenerate cases involving nodes isolated from the other nodes in its cluster. We refer to the edges of a node which are not a self-loop as external edges.

Let be a soft cluster assignment function that maps the vertices to the -dimensional unit simplex. We denote by the probability that vertex belongs to cluster and . This function can be completely specified by a cluster assignment matrix with . This is the soft cluster assignment matrix computed by minCut and Diff pool via a GCN.

Definition 6.1.

Let be a graph with self-loops for each node. The expanded graph of is the graph constructed as follows. For each node with at least one external edge, there is a clique of nodes . For each external edge , a pair of nodes from their corresponding cliques without any edges outside their clique are connected. Additionally, isolated nodes become in the new graph two nodes connected by an edge.

Essentially, the connections between the nodes in the original graph are replaced by the connections between the newly formed cliques such that every new node is connected to at most one node outside its clique. An example of an expanded graph is shown in Figure 8 (b). The nodes in the newly formed cliques are coloured similarly to node from the original graph they originate from.

Definition 6.2.

Let be a graph with self loops for each node, a soft cluster assignment function for it, and the expanded graph of . Then the expanded soft cluster assignment is a cluster assignment function for , where for all the nodes in the corresponding clique of .

In plain terms, all the nodes in the clique inherit through the cluster assignments of the corresponding node from the original graph. This is also illustrated by the coloured contours of the expanded graph in Figure 8 (b).

Definition 6.3.

Let be a soft cluster assignment matrix for a graph with adjacency matrix . The 1-hop expansion of assignment with respect to graph is a new cluster assignment function induced by the row-normalised version of the cluster assignment matrix .

As we shall now prove, the 1-hop expansion simply extends each soft cluster from by adding its 1-hop neighbourhood as in Figure 8 (c).

Lemma 6.1.

An element of the soft cluster assignment matrix if and only if node is connected to a node (possibly ), which is part of the soft cluster of the assignment induced by .


By definition if and only if , for all . This can happen if and only if (nodes and are not connected by an edge) or (node does not belong to soft cluster defined by ) for all . Therefore, if and only if there exists a node such that is connected to and belongs to soft cluster defined by . ∎

Corollary 6.1.1.

Nodes that are part of a cluster defined by , are also part of under the assignment .


This immediately follows from Lemma 6.1 and the fact that each node has a self-loop. ∎

Lemma 6.2.

The adjacency matrix defines a new graph where the clusters induced by are connected if and only if there is a common edge between them.


Let . Then, if and only if (node does not belong to cluster ) or (node is not connected to any node belonging to cluster by Lemma 6.1), for all . Therefore, if and only if there exists a node such that belongs to cluster i and is connected to a node from cluster . ∎

This result shows that soft cluster assignment methods connect clusters that have at least one edge between them. We will use this result to show that a Mapper construction obtains an identical graph.

Let be the 1-hop expansion of the expanded soft cluster assignment of graph . Let the soft clusters induced by be . Additionally, let be an open cover of with . Then the pull back cover induced by is since (i.e. all nodes with a non-zero probability of belonging to cluster k).

Lemma 6.3.

Two clusters and have a non-empty intersection in the expanded graph if and only if their corresponding clusters , in the original graph have a common edge between them.


If direction: By Corollary 6.1.1, the case of a self edge becomes trivial. Now, let and be two nodes connected by an external edge in the original graph. Then in the expanded graph , there will be clique nodes such that . By taking the 1-hop expansion of the extended cluster assignment, both and will belong to by Lemma 6.1, since they are in each other’s 1-hop neighbourhood. Since we have chosen the clusters and the nodes arbitrarily, this proves this direction.

Only if direction: Let and be the (expanded) clusters in the expanded graph corresponding to the clusters and in the original graph. Let node be part of the non-empty intersection between soft clusters and defined by in the expanded graph . By Lemma 6.1, belongs to if and only if there exists such that and . Similarly, there must exist a node such that and . By the construction of , either both are part of the clique is part of, or one of them is in the clique, and the other is outside the clique.

Suppose without loss of generality that is in the clique and is outside the clique. Then, since they share the same cluster assignment. By the construction of the edge between the corresponding nodes of and in the original graph is an edge between and . Similarly, if both and are part of the same clique with , then they all originate from the same node in the original graph. The self-edge of is an edge between to . ∎

Proposition 6.1.

Let be a graph with self loops and adjacency matrix . The graphs defined by and are isomorphic.


Based on Lemma 6.3, connects two soft clusters in defined by if and only if there is a common edge between them. By Lemma 6.2, soft cluster assignment methods connect the soft clusters identically through the adjacency matrix . Therefore, the unweighted graphs determined by and are isomorphic. ∎

Note that our statement refers strictly to unweighted edges. However, an arbitrary weight function can easily be attached to the graph obtained though .

Appendix B: Structural Deep Graph Mapper

The edges between the nodes in a DGM visualisation denote semantic similarities discovered by the lens. However, even though semantically-related nodes are often neighbours in many real-world graphs, this is not always the case. For example, GNNs have been shown to ignore this assumption, often relying entirely on the graph features (Luzhnica et al., 2019b).

Figure 9: SDGM visualisation of the Spammer dataset. The thickness of the edges is now proportional to the number of edges between clusters. We used a filtration value of and for the overlap. This visualisation also illustrates that the spammers are densely connected to the other nodes in the graph, while non-spammers form smaller connected groups. However, unlike the DGM visualisation, this graph also shows the (structural) edges between the spammers and the non-spammers.
Figure 10: SDGM visualisation on Cora using DGI lens and ground-truth labels, with varying values of . Lower values of increase the connectivity of the graph.

Therefore, it is sometimes desirable that the connectivity of the graph is explicitly accounted for in the visualisations, being involved in more than simply computing the refined pull back cover. Motivated by the proof from Appendix A, we also propose Structural DGM (SDGM), a version of DGM that connects the elements of the refined pull back cover based on the number of edges between the component nodes from the original graph.

SDGM uses the refined pull back cover induced by the 1-hop expansion in the expanded graph (see Appendix A) to compute the nerve. However, a downside of this approach is that the output graph may often be too dense to visualise. Therefore, we use a weighting function to weight the edges of the resulting graph and a filtration value . We then filter out all the edges with , where is determined by the normalised weighted adjacency matrix denoting the (soft) number of edges between two clusters. Figure 9 includes an SDGM visualisation for the spammer graph.

The overlap parameter effectively sets a trade-off between the number of structural and semantic connections in the SDGM graph. For , only structural connections exist. At the same time, the filtration constant is an additional parameter that adjusts the resolution of the visualisations. Higher values of result in sparser graphs, while lower values increase the connectivity between the nodes. We illustrate the effects of varying in Figure 10.

Appendix C: Model Architecture and Hyperparameters

The optimised models described in the Experiments section have the following configurations:

  • DGM—learning rate , hidden sizes and:

    • D&D and Collab: cover sizes , interval overlap , batch size ;

    • Proteins: cover sizes , interval overlap , batch size ;

    • Reddit-Binary: cover sizes , interval overlap , batch size ;

  • Top-—specific dataset configurations, as provided in the official GitHub repository (;

  • minCUT—learning rate , same architecture as reported by the authors in the original work (Bianchi et al., 2019);

  • DiffPool—learning rate , hidden size , two pooling steps, pooling ratio , global average mean readout layer, with the exception of Collab and Reddit-Binary, where the hidden size was .

We additionally performed a hyperparameter search for DiffPool on hidden sizes and for DGM, over the following sets of possible values:

  • all datasets: cover sizes , interval overlap ;

  • D&D: learning rate ;

  • Proteins: learning rate , cover sizes , hidden sizes .


  1. Note that one or more {embedding pooling} operations may be sequentially performed in the pipeline.


  1. The PageRank Axioms. In Dagstuhl Seminar Proceedings, Cited by: §4.3.2.
  2. NCVis: Noise Contrastive Approach for Scalable Visualization. External Links: 2001.11411 Cited by: §3.3.
  3. Cited by: §2.1.
  4. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
  5. Mincut Pooling in Graph Neural Networks. arXiv preprint arXiv:1907.00481. Cited by: §1, §2.2, §5.3.1, §5.3.1, 3rd item.
  6. PageRank as a Function of the Damping Factor. In Proceedings of the 14th international conference on World Wide Web, pp. 557–566. Cited by: §4.3.2.
  7. Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  8. Towards Sparse Hierarchical Graph Classifiers. arXiv preprint arXiv:1811.01287. Cited by: §2.2.
  9. PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures. stat 1050, pp. 17. Cited by: §2.3.
  10. Statistical Analysis and Parameter Selection for Mapper. The Journal of Machine Learning Research 19 (1), pp. 478–516. Cited by: §4.2.
  11. An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists. arXiv preprint arXiv:1710.04019. Cited by: §1.
  12. UC Berkeley CS267 - Lecture 20: Partitioning Graphs without Coordinate Information II. External Links: Link Cited by: §4.1.
  13. Motif Simplification: Improving Network Visualization Readability with Fan, Connector, and Clique Glyphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, New York, NY, USA, pp. 3247–3256. External Links: ISBN 9781450318990, Link, Document Cited by: §2.1.
  14. Edge Compression Techniques for Visualization of Dense Directed Graphs. IEEE Transactions on Visualization and Computer Graphics 19 (12), pp. 2596–2605. External Links: Document, ISSN 2160-9306 Cited by: §2.1.
  15. Persistent Homology—a Survey. Discrete & Computational Geometry - DCG 453, pp. . External Links: Document Cited by: §2.3.
  16. Algebraic Connectivity of Graphs. Czechoslovak mathematical journal 23 (2), pp. 298–305. Cited by: Proposition 4.1.
  17. Topology of Learning in Artificial Neural Networks. Cited by: §2.3.
  18. A look at the topology of convolutional neural networks. CoRR abs/1810.03234. External Links: Link, 1810.03234 Cited by: §2.3.
  19. An open graph visualization system and its applications to software engineering. Software: practice and experience 30 (11), pp. 1203–1233. Cited by: §2.1.
  20. Graph U-Nets. In International Conference on Machine Learning, pp. 2083–2092. Cited by: §2.2, §5.3.1.
  21. Exploring Network Structure, Dynamics, and Function using NetworkX. Technical report Los Alamos National Lab.(LANL), Los Alamos, NM (United States). Cited by: §2.1, §4.3.2.
  22. Mapper on Graphs for Network Visualization. External Links: 1804.11242 Cited by: Figure 2, §2.1, §2.3, §3.1, §4.1.
  23. MOG: Mapper on Graphs for Relationship Preserving Clustering. arXiv preprint arXiv:1804.11242. Cited by: §3.2.
  24. Deep Learning with Topological Signatures. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 1633–1643. External Links: ISBN 9781510860964 Cited by: §2.3.
  25. Graph Filtration Learning. CoRR abs/1905.10996. External Links: Link, 1905.10996 Cited by: §2.3.
  26. AttPool: Towards Hierarchical Feature Representation in Graph Convolutional Networks via Attention Mechanism. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6480–6489. Cited by: §2.2.
  27. Principal component analysis. Springer Verlag, New York. Cited by: §3.3.
  28. Benchmark Data Sets for Graph Kernels. External Links: Link Cited by: §5.1.
  29. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
  30. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2, §4.3.1.
  31. Self-Attention Graph Pooling. In International Conference on Machine Learning, pp. 3734–3743. Cited by: §2.2.
  32. CS224W: Social and Information Network Analysis - Graph Clustering. External Links: Link Cited by: §4.1.
  33. Clique pooling for graph classification. arXiv preprint arXiv:1904.00374. Cited by: §2.2.
  34. On graph classification networks, datasets and baselines. ArXiv abs/1905.04682. Cited by: Appendix B: Structural Deep Graph Mapper.
  35. Graph Convolutional Networks with EigenPooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §2.2.
  36. The PageRank Citation Ranking: Bringing Order to the Web.. Technical report Stanford InfoLab. Cited by: §1, §4.3.2.
  37. ASAP: Adaptive Structure Aware Pooling for Learning Hierarchical Graph Representations. arXiv preprint arXiv:1911.07979. Cited by: §2.2.
  38. The Graph Neural Network Model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
  39. Collective Classification in Network Data. AI magazine 29 (3), pp. 93–93. Cited by: §3.5, §5.1.
  40. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. The Eurographics Association (eng). External Links: Document Cited by: §1, §3.1.
  41. NodeXL: a free and open network overview, discovery and exploration add-in for Excel 2007/2010. Cited by: §2.1.
  42. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §3.3.
  43. Deep Graph Infomax. arXiv preprint arXiv:1809.10341. Cited by: §3.4.
  44. GNNExplainer: Generating Explanations for Graph Neural Networks. In Advances in Neural Information Processing Systems, pp. 9240–9251. Cited by: §2.1.
  45. Hierarchical Graph Representation Learning with Differentiable Pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1, §2.2, §5.3.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description