Deep Graph Mapper
Abstract
Recent advancements in graph representation learning have led to the emergence of condensed encodings that capture the main properties of a graph. However, even though these abstract representations are powerful for downstream tasks, they are not equally suitable for visualisation purposes. In this work, we merge Mapper, an algorithm from the field of Topological Data Analysis (TDA), with the expressive power of Graph Neural Networks (GNNs) to produce hierarchical, topologicallygrounded visualisations of graphs. These visualisations do not only help discern the structure of complex graphs but also provide a means of understanding the models applied to them for solving various tasks. We further demonstrate the suitability of Mapper as a topological framework for graph pooling by mathematically proving an equivalence with MinCut and Diff Pool. Building upon this framework, we introduce a novel pooling algorithm based on PageRank, which obtains competitive results with state of the art methods on graph classification benchmarks.
1 Introduction
Tasks involving graphstructured data have received much attention lately, due to the abundance of relational information in the real world. Considerable progress has been made in the field of graph representation learning through generating graph encodings with the help of deep learning techniques. The abstract representations obtained by these models are typically intended for further processing within downstream tasks. However, few of these advancements have been directed towards visualising and aiding the human understanding of the complex networks ingested by machine learning models. We believe that data and model visualisation are important steps of the statistical modelling process and deserve an increased level of attention.
Here, we tackle this problem by merging Mapper (Singh et al., 2007), an algorithm from the field of Topological Data Analysis (TDA) (Chazal and Michel, 2017), with the demonstrated representational power of Graph Neural Networks (GNNs) (Scarselli et al., 2008; Battaglia et al., 2018; Bronstein et al., 2017) and refer to this synthesis as Deep Graph Mapper (DGM). Our method offers a means to visualise graphs and the complex data living on them through a GNN ‘lens’. Moreover, the aspects highlighted by the visualisation can be flexibly adjusted via the loss function of the network. Finally, DGM achieves progress with respect to GNN interpretability, providing a way to visualise the model and identify the mistakes it makes for nodelevel supervised and unsupervised learning tasks.
We then demonstrate that Mapper graph summaries are not only suitable for visualisation, but can also constitute a pooling mechanism within GNNs. We begin by proving that Mapper is a generalisation of pooling methods based on soft cluster assignments, which include stateoftheart algorithms like minCUT (Bianchi et al., 2019) and DiffPool (Ying et al., 2018). Building upon this topological perspective, we propose MPR, a novel graph pooling algorithm based on PageRank (Page et al., 1999). Our method obtains competitive or superior results when compared with stateoftheart pooling methods on graph classification benchmarks. To summarise, our contributions are threefold:

DGM, a topologicallygrounded method for visualising graphs and the GNNs applied to them.

A proof that Mapper is a generalisation of soft cluster assignment pooling methods, including the stateoftheart minCUT and Diff pool.

MPR, a Mapperbased pooling method that achieves similar or superior results compared to stateoftheart methods on several graph classification benchmarks.
2 Related Work
2.1 Graph Visualisation and Interpretability
A number of software tools exist for visualising nodelink diagrams: NetworkX (Hagberg et al., 2008), Gephi (Bastian et al., 2009), Graphviz (Gansner and North, 2000) and NodeXL (Smith et al., 2010). However, these tools do not attempt to produce condensed summaries and consequently suffer on large graphs from the visual clutter problem illustrated in Figure 1. This makes the interpretation and understanding of the graph difficult. Some of the earlier attempts to produce visual summaries rely on grouping nodes into a set of predefined motifs (Dunne and Shneiderman, 2013) or compressing them in a lossless manner into modules (Dwyer et al., 2013). However, these mechanisms are severely constrained by the simple types of node groupings that they allow.
Mapperbased summaries for graphs have recently been considered by Hajij et al. (2018a). However, despite the advantages provided by Mapper, their approach relies on handcrafted graphtheoretic ‘lenses’, such as the average geodesic distance, graph density functions or eigenvectors of the graph Laplacian. Not only are these methods rigid and unable to adapt well to the graph or task of interest, but they are also computationally inefficient. Moreover, they do not take into account the features of the graph. In this paper, we build upon their work by considering learnable functions (GNNs) that do not present these problems.
Mapper visualisations are also an indirect way to analyse the behaviour of the associated ‘lens’ function. However, visualisations that are directly oriented towards model interpretability have been recently considered by Ying et al. (2019), who propose a model capable of indicating the relevant subgraphs and features for a given model prediction.
2.2 Graph Pooling
Pooling algorithms have already been considerably explored within GNN frameworks for graph classification. Luzhnica et al. (2019) propose a topological approach to pooling which coarsens the graph by aggregating its maximal cliques into new clusters. However, cliques are local topological features, whereas our MPR algorithm leverages a global perspective of the graph during pooling. Two paradigms distinguish themselves among learnable pooling layers: top pooling based on a learnable ranking, initially adopted by Gao and Ji (2019) (Graph UNets), and learning the cluster assignment (Ying et al., 2018) with additional entropy and link prediction losses for more stable training (DiffPool). Following these two trends, several variants and incremental improvements have been proposed. The top approach is explored in conjunction with jumpingknowledge networks (Cangea et al., 2018), attention (Lee et al., 2019; Huang et al., 2019) and selfattention for cluster assignment (Ranjan et al., 2019). Similarly to DiffPool, the method suggested by Bianchi et al. (2019) uses several loss terms to enforce clusters with strongly connected nodes, similar sizes and orthogonal assignments. A different approach is also proposed by Ma et al. (2019), who leverage spectral clustering for pooling.
2.3 Topological Data Analysis in Machine Learning
Persistent homology (Edelsbrunner and Harer, 2008) has been so far the most popular branch of TDA applied to machine learning and graphs, especially. Hofer et al. (2017) and Carriere et al. (2019) integrated graph persistence diagrams with neural networks to obtain topologyaware models. A more advanced approach for GNNs has been proposed by Hofer et al. (2019), who backpropagate through the persistent homology computation to directly learn a graph filtration.
Mapper, another central algorithm in TDA, has been used in deep learning almost exclusively as a tool for understanding neural networks. Gabella et al. (2019) use Mapper to visualise the evolution of the weights of fully connected networks during training, while Gabrielsson and Carlsson (2018) use it to visualise the filters computed by CNNs. To the best of our knowledge, the paper of Hajij et al. (2018a) remains the only application of Mapper on graphs.
3 Mapper for Visualisations
In this section we describe the proposed integration between Mapper and GCNs.
3.1 Mapper on Graphs
We start by reviewing Mapper (Singh et al., 2007), a topologicallymotivated algorithm for highdimensional data visualisation. Intuitively, Mapper obtains a lowdimensional image of the data that can be easily visualised. The algorithm produces an output graph that shows how clusters within the data are semantically related from the perspective of a ‘lens’ function. The resulting graph preserves the notion of ‘nearness’ of the input topological space, but it can compress large scale distances by connecting faraway points that are similar according to the lens. We first introduce the required mathematical background.
Definition 3.1.
An open cover of a topological space is a collection of open sets , for some indexing set , whose union includes .
For example, an open cover for the real numbers could be . Similarly, is an open cover for a set of vertices .
Definition 3.2.
Let be a topological space, a continuous function, and a cover of . Then, the pull back cover of induced by is the collection of open sets , for some indexing set , where by we denote the preimage of the set .
Given a dataset , a carefully chosen lens function and cover , Mapper first computes the associated pull back cover . Then, using a clustering algorithm of choice, it clusters each of the open sets in . The resulting group of sets is called the refined pull back cover, denoted by with indexing set . Concretely, in this paper, the input dataset is a weighted graph and is a function over the vertices of the graph. An illustration for these steps is provided in Figure 2 (ae) for a ‘height’ function .
Finally, Mapper produces a new graph by taking the skeleton of the nerve of the refined pull back cover: a graph where the vertices are given by and two vertices are connected if and only if . Informally, the soft clusters formed by the refined pull back become the new nodes and clusters with common nodes are connected by an edge. This final step is illustrated in Figure 2 (f).
Three main degrees of freedom that determine the visualisation can be distinguished within Mapper:
The lens : In our case, the lens is a function over the vertices, which acts as a filter that emphasises certain features of the graph. The choice of highly depends on the properties to be highlighted by the visualisation. We also refer to the codomain of as the parametrisation space.
The cover : The choice of cover determines the resolution of the summary. Finegrained covers will produce more detailed visualisations, while higher overlaps between the sets in the cover will increase the connectivity of the output graph. When , a common choice is to select a set of overlapping intervals over the real line, as in Figure 2 (c).
Clustering algorithm: Mapper works with any clustering algorithm. When the dataset is a graph, a natural choice adopted by Hajij et al. (2018a) is to take the connected components of the subgraphs induced by the vertices (Figure 2 (ef)). This is also the approach we follow, unless otherwise stated.
3.2 Seeing through the Lens of GCNs
As mentioned in Section 2.1, Hajij et al. (2018b) have considered a set of graph theoretic functions for the lens. However, with the exception of PageRank, all of these functions are difficult to compute on large graphs. The average geodesic distance and the graph density function require computing the distance matrix of the graph, while the eigenfunction computations do not scale beyond graphs with a few thousands of nodes. Besides their computational complexity, many realworld graphs contain features within the nodes and edges of the graphs, which are ignored by graphtheoretic summaries.
In this work, we leverage the recent progress in the field of graph representation learning and propose a series of lens functions based on Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016). We refer to this integration between Mapper and GCNs as Deep Graph Mapper (DGM).
Unlike graphtheoretic functions, GCNs can naturally learn and integrate the features associated with the graph and its topological properties, while also scaling to large, complex graphs. Additionally, visualisations can flexibly be tuned for the task of interest by adjusting the associated loss function.
Mapper further constitutes a method for implicitly visualising the lens function, so DGM is also a novel approach to model understanding. Figure 4 illustrates how Mapper can be used to identify mistakes that the underlying GCN model makes in node classification. This showcases the potential of DGM for continuous model refinement.
3.3 Supervised Lens
A natural application for DGM in the supervised domain is as an assistive tool for binary node classification. The node classifier, a function , is an immediate candidate for the DGM lens. The usual choice for a cover over the real numbers is a set of equallysized overlapping intervals, with an overlap percentage of . This often produces a hierarchical (treelike) perspective over the input graph, as shown in Figure 1.
However, most nodeclassification tasks involve more than two labels. A simple solution in this case is to use a dimensionality reduction algorithm such as SNE (van der Maaten and Hinton, 2008) to embed the logits in , where is small. Empirically, when the number of classes is larger than two, we find a 2D parametrisation space to better capture the relationships between the classes in the semantic space. Figure 3 includes a visualisation of the Cora dataset using a SNE embedding of the logits. Faster dimensionality reduction methods such as PCA (Jolliffe, 2002) or the recently proposed NCVis (Artemenkov and Panov, 2020) could be used to scale this approach to large graphs.
In a supervised setting, DGM visualisations can also be integrated with the groundtruth labels of the nodes. The latter provide a means for visualising both the mistakes that the classifier is making and the relationships between classes. For lens trained to perform binary classification of the nodes, we colour the nodes of the summary graph proportionally to the number of nodes in the original graph that are part of that cluster and belong to the positive class. For lens that are multilabel classifiers, we colour each node with the most frequent class in its corresponding cluster. Figure 4 gives the labelled summary for the binary Spam dataset, while Figure 7 includes two labelled visualisations for the Cora and CiteSeer datasets.
3.4 Unsupervised Lens
The expressive power of GNNs is not limited to supervised tasks. In fact, many graphs do not have any labels associated with their nodes—in this case, the lens described so far, which require supervised training, could not be applied. However, the recent models for unsupervised graph representation learning constitute a powerful alternative.
Here, we use Deep Graph Infomax (DGI) (Veličković et al., 2018) to compute node embeddings in and obtain a lowdimensional parametrisation of the graph nodes. DGI computes nodelevel embeddings by learning to maximise the mutual information (MI) between patch representations and corresponding highlevel summaries of graphs. We have empirically found that applying SNE over a higherdimensional embedding of DGI works better than learning a lowdimensional parametrisation with DGI directly. Figure 7 includes two labelled visualisations obtained with the unsupervised DGI lens on Cora and CiteSeer.
3.5 Hierarchical Visualisations
One of the biggest advantages of Mapper is the flexibility to adjust the resolution of the visualisation by modifying the cover . The number of sets in determines the coarseness of the output graph, with a larger number of sets generally producing a larger number of nodes. The overlap between these sets determines the level of connectivity between the nodes—for example, a higher degree of overlap produces denser graphs. By adjusting these two parameters, one can discover what sort of properties of the graph are persistent, consequently surviving at multiple scales, and which features are mainly due to noise. In Figure 5, we include multiple visualisations of the CiteSeer (Sen et al., 2008) dataset at various scales—these are determined by adjusting the cover via its size and interval overlap percentage .
3.6 The Dimensionality of the Output Space
Although Mapper typically uses a onedimensional parametric space, higherdimensional ones have extra degrees of freedom in embedding the nodes inside them. Therefore, higherdimensional spaces feature more complex neighbourhoods or relations of semantic proximity, as it can be seen from Figures 1 (1D), 3 (2D) and 6 (3D).
However, the interpretability of the lens decreases as increases, for a number of reasons. First, colormaps can be easily visualised in 1D and 2D, but it is hard to visualise a 3Dcolormap. Therefore, for the 3D parametrisation in Figure 6, we use dataset labels to colour the nodes. Second, the curse of dimensionality becomes a problem and the open sets of the cover are likely to contain fewer and fewer nodes as increases. Therefore, the resolution of the visualisation can hardly be adjusted in higher dimensional spaces.
4 Mapper for Pooling
We now present a theoretical result showing that the graph summaries computed by Mapper are not only useful for visualisation purposes, but also as a pooling (graphcoarsening) mechanism inside GNNs. Building on this evidence, we introduce a pooling method based on PageRank and Mapper.
4.1 Mapper and Spectral Clustering
The relationship between Mapper for graphs and spectral clustering has been observed by Hajij et al. (2018a). This link is a strong indicator that Mapper can compute ‘useful’ clusters for pooling. We formally restate this observation below and provide a short proof.
Proposition 4.1.
Let be the Laplacian of a graph and the eigenvector corresponding to the second lowest eigenvalue of , also known as the Fiedler vector (Fiedler, 1973). Then, for a function , outputting the entry in the eigenvector corresponding to node and a cover , Mapper produces a spectral bipartition of the graph for a sufficiently small positive .
Proof.
It is well known that the Fiedler vector can be used to obtain a “good” bipartition of the graph based on the signature of the entries of the vector (i.e. and ) (please refer to Demmel (1995) for a proof). Therefore, by setting to a sufficiently small positive number , the obtained pull back cover is a spectral bipartition of the graph. ∎
The result above indicates that Mapper is a generalisation of spectral clustering. As the latter is strongly related to mincuts (Leskovec, 2016), the proposition also provides a link between Mapper and mincuts.
4.2 Mapper and Soft Cluster Assignments
Let be a graph with selfloops for each node and adjacency matrix . Soft cluster assignment pooling methods use a soft cluster assignment matrix , where is the number of nodes in the graph and is the number of clusters, and compute the new adjacency matrix of the graph via . Equivalently, two soft clusters become connected if and only if there is a common edge between them.
Proposition 4.2.
There exists a graph derived from , a soft cluster assignment of based on , and a cover of , the dimensional unit simplex, such that the 1skeleton of the nerve of the pull back cover induced by is isomorphic with the graph defined by .
This shows that Mapper is a generalisation of softcluster assignment methods. A detailed proof and a diagrammatic version of it can be found in the supplementary material. Note that this result implicitly uses an instance of Mapper with the simplest possible clustering algorithm that assigns all vertices in each open set of the pull back cover to the same cluster (same as no clustering).
We hope that this result will enable theoreticians to study pooling operators through the topological and statistical properties of Mapper (Carriere et al., 2018). At the same time, we encourage practitioners to take advantage of it and design new pooling methods in terms of a wellchosen lens function and cover for its image.
A remaining challenge is designing differentiable pull back cover operations to automatically learn a parametric lens and cover. We leave this exciting direction for future work and focus on exploiting the proven connection—we propose a simple, nondifferentiable pooling method based on PageRank and Mapper that performs surprisingly well.
4.3 PageRank Pooling
Model
For the graph classification task, each example G is represented by a tuple , where is the node feature matrix and is the adjacency matrix. Both our graph embedding and classification networks consist of a sequence of graph convolutional layers Kipf and Welling (2016); the th layer operates on its input feature matrix as follows:
(1) 
where is the adjacency matrix with selfloops, is the normalised node degree matrix and is the activation function.
After layers, the embedding network simply outputs node features , which are subsequently processed by a pooling layer to coarsen the graph. The classification network first takes as input node features of the Mapperpooled (Section 4.3.2) graph
(2) 
where is the number of nodes in the final graph.
Mapperbased PageRank (MPR) Pooling
We now describe the pooling mechanism used in our graph classification pipeline, which we adapt from the Mapper algorithm. The first step is to assign each node a real number in , achieved by computing a lens function that is given by the normalised PageRank (PR) Page et al. (1999) of the nodes. The PageRank function assigns an importance value to each of the nodes based on their connectivity, according to the following recurrence relation:
(3) 
where represents the set of neighbours of the th node in the graph. The resulting scores are values in which reflect the probability of a random walk through the graph to end in a given node. Computing the vector is achieved using NetworkX (Hagberg et al., 2008) via power iteration, as is the principal eigenvector of the transition matrix of the graph:
(4) 
where is a matrix with all elements equal to 1 and is the probability of continuing the random walk at each step; a value closer to implies the nodes would receive a more uniform ranking and tend to be clustered in a single node. We choose the widelyadopted and refer the reader to (Boldi et al., 2005) for more details.
We use the previously described overlapping intervals cover and determine the pull back cover induced by . This effectively builds a soft cluster assignment matrix from nodes in the original graph to ones in the pooled graph:
(5) 
where is the th overlapping interval in the cover of . It can be observed that the resulting clusters contain nodes with similar PageRank scores, as determined by eq. 3. Therefore, our pooling method intuitively merges the (usually few) highly connected nodes in the graph, and at the same time clusters the (typically many) dangling nodes that have a normalised PageRank score closer to zero.
Finally, the mapping is used to compute features for the new nodes (i.e. the soft clusters formed by the pull back), , and the corresponding adjacency matrix, .
It is important that graph classification models are node permutation invariant since one graph can be represented by any tuple , where is a node permutation. Bellow, we state a positive result in this regard for the MPR pooling procedure.
Proposition 4.3.
The PageRank pooling operator defined above is permutationinvariant.
Proof.
First, we note that the PageRank function is permutation invariant and refer the reader to Altman (2005, Axiom 3.1) for the proof. It then follows trivially that the PageRank pooling operator is also permutationinvariant. ∎
5 Experiments
We now provide a qualitative evaluation of the DGM visualisations and benchmark MPR pooling on a set of graph classification tasks.
5.1 Tasks
The DGM visualiser is evaluated on two popular citation networks: CiteSeer and Cora (Sen et al., 2008). We further showcase the applicability of DGM within a pooling framework, reporting its performance on a variety of settings: social (RedditBinary), citation networks (Collab) and chemical data (D&D, Proteins) Kersting et al. (2016).
5.2 Qualitative Results
In this section, we qualitatively compare DGM using a DGI lens against a Mapper instance that uses a finetuned graph density function based on an RBF kernel (Figure 7) with being the distance matrix of the graph. For reference, we also include a fullgraph visualisation using a Graphviz layout and a tSNE plot of the DGI embeddings that are used as the lens.
5.3 Pooling Evaluation
We adopt a 10fold crossvalidation approach to evaluating the graph classification performance of MPR and other competitive stateoftheart methods. The random seed was set to for all experiments (with respect to dataset splitting, shuffling and parameter initialisation), in order to ensure a fair comparison across architectures. All models were trained on a Titan Xp GPU, using the Adam optimiser Kingma and Ba (2014) with early stopping on the validation set, for a maximum of 30 epochs. We report the classification accuracy using 95% confidence intervals calculated for a population size of 10 (the number of folds).
Models
We compare the performance of MPR to two other pooling approaches that we identify mathematical connections with—minCUT Bianchi et al. (2019) and DiffPool Ying et al. (2018). Additionally, we include Graph UNet Gao and Ji (2019) in our evaluation, as it has been shown to yield competitive results, while performing pooling from the alternative perspective of a learnable node ranking; we denote this approach by Top in the remainder of this section.
We optimise MPR with respect to its cover cardinality , interval overlap percentage at each pooling layer, learning rate and hidden size. The Top architecture is evaluated using the code provided in the official repository
(6) 
where is the symmetrically normalised adjacency matrix and are learnable weight matrices representing the message passing and skipconnection operations within the layer. The DiffPool model follows the same sequence of steps. Full details of the model architectures and hyperparameters can be found in the supplementary material.
Results
The graph classification performance obtain by these models is reported in Table 1. By suitably adjusting MPR hyperparameters, we achieve the best results for D&D, Proteins and Collab and closely follow minCUT on RedditBinary. These results showcase the utility of Mapper for designing better pooling operators.
MPR  Top  minCUT  DiffPool  

D&D  
Proteins  
Collab  
RedditB 
6 Conclusion and Future Work
We have introduced Deep Graph Mapper, a topologicallygrounded method for producing informative hierarchical graph visualisations with the help of GNNs. We have shown these visualisations are not only helpful for understanding various graph properties, but can also aid in refining graph models. Additionally, we have proved that Mapper is a generalisation of soft cluster assignment methods, effectively providing a bridge between graph pooling and the TDA literature. Based on this connection, we have proposed a simple Mapperbased PageRank pooling operator, competitive with several stateoftheart methods on graph classification benchmarks. Future work will focus on backpropagating through the pull back computation to automatically learn a lens and cover. Lastly, we plan to extend our methods to spatiotemporally evolving graphs.
Acknowledgement
We would like to thank Petar Veličković, Ben Day, Felix Opolka, Simeon Spasov, Alessandro Di Stefano, Duo Wang, Jacob Deasy, Ramon Viñas, Alex Dumitru and Teodora Reu for their constructive comments. We are also grateful to Teo Stoleru for helping with the diagrams.
Appendix A: Proof of Proposition 4.2
Throughout the proof we consider a graph with self loops for each node. The selfloop assumption is not necessary, but it elegantly handles a number of degenerate cases involving nodes isolated from the other nodes in its cluster. We refer to the edges of a node which are not a selfloop as external edges.
Let be a soft cluster assignment function that maps the vertices to the dimensional unit simplex. We denote by the probability that vertex belongs to cluster and . This function can be completely specified by a cluster assignment matrix with . This is the soft cluster assignment matrix computed by mincut and diff pool via a GCN.
Definition 6.1.
Let be a graph with self loops for each node. The expanded graph of is the graph constructed as follows. For each node with at least one external edge, there is a clique of nodes . For each external edge , a pair of nodes from their corresponding cliques without any edges outside their clique are connected. Additionally, isolated nodes become in the new graph two nodes connected by an edge.
Essentially, the connections between the nodes in the original graph are replaced by the connections between the newly formed cliques such that every new node is connected to at most one node outside its clique. An example of an expanded graph is shown in Figure 8 (b).
Definition 6.2.
Let be a graph with self loops for each node, a soft cluster assignment function for it, and the expanded graph of . Then the expanded soft cluster assignment is a cluster assignment function for , where for all the nodes in the corresponding clique of .
In plain terms, all the nodes in the clique inherit through the cluster assignments of the corresponding node from the original graph. This is also illustrated by the coloured contours of the expanded graph in Figure 8 (b).
Definition 6.3.
Let be a soft cluster assignment matrix for a graph with adjacency matrix . The 1hop expansion of assignment with respect to graph is a new cluster assignment function induced by the rownormalised version of the cluster assignment matrix .
As we shall now prove, the 1hop expansion simply extends each soft cluster from by adding its 1hop neighbourhood as in Figure 8 (c).
Lemma 6.1.
if and only if node is connected to a node (possibly ) which is part of the soft cluster of the assignment induced by .
Proof.
By definition if and only if . This can happen if and only if (nodes and are not connected by an edge) or (node does not belong to soft cluster defined by ) for all . Therefore, if and only if there exists a node such that is connected to and belongs to soft cluster defined by . ∎
Corollary 6.1.1.
Nodes that are part of a cluster defined by , are also part of under the assignment .
Proof.
This immediately follows from Lemma 6.1 and the fact that each node has a selfloop. ∎
Let be the 1hop expansion of the expanded soft cluster assignment of graph . Let the soft clusters induced by be . Additionally, let be an open cover of with . Then the pull back cover induced by is since (i.e. all nodes with a nonzero probability of belonging to cluster k).
Lemma 6.2.
Two clusters and have a nonempty intersection in the expanded graph if and only if their corresponding clusters , in the original graph have a common edge between them.
Proof.
If direction: By Corollary 6.1.1, the case of a self edge becomes trivial. Now, let and be two nodes connected by an external edge in the original graph. Then in the expanded graph , there will be clique nodes such that . By taking the 1hop expansion of the extended cluster assignment, both and will belong to by Lemma 6.1. Since we have chosen the clusters and the nodes arbitrarily, this proves this direction.
Only if direction: Let node be part of the nonempty intersection between two soft clusters and defined by in the expanded graph . By Lemma 6.1, belongs to if and only if such that and . Similarly, there must exist a node such that and . By the construction of , either both are part of the clique is part of, or one of them is in the clique, and the other is outside the clique.
Suppose without loss of generality that is in the clique and is outside the clique. Then, since they share the same cluster assignment. By the construction of the edge between the corresponding nodes of and in the original graph is an edge between and . Similarly, if both and are part of the same clique with , then they all originate from the same node in the original graph. The selfedge of is an edge between to . ∎
Proposition 6.1.
Let be a graph with self loops and adjacency matrix . The graphs defined by and are isomorphic.
Proof.
Based on Lemma 6.2, connects two soft clusters defined by if and only if there is a common edge between them. ∎
Appendix B: Model Architecture and Hyperparameters
The optimised models described in Section 5.3 have the following configurations:

DGM—learning rate , hidden sizes and:

D&D and Collab: cover sizes , interval overlap , batch size ;

Proteins: cover sizes , interval overlap , batch size ;

RedditBinary: cover sizes , interval overlap , batch size ;


Top—specific dataset configurations, as provided in the official GitHub repository (run_GUNet.sh);

minCUT—learning rate , same architecture as reported by the authors in the original work (Bianchi et al., 2019);

DiffPool—learning rate , hidden size , two pooling steps, pooling ratio , global average mean readout layer, with the exception of Collab and RedditBinary, where the hidden size was .
We additionally performed a hyperparameter search for DiffPool on hidden sizes and for DGM, over the following sets of possible values:

all datasets: cover sizes , interval overlap ;

D&D: learning rate ;

Proteins: learning rate , cover sizes , hidden sizes .
Footnotes
 Note that one or more {embedding pooling} operations may be sequentially performed in the pipeline.
 github.com/HongyangGao/GraphUNets
References
 The PageRank Axioms. In Dagstuhl Seminar Proceedings, Cited by: §4.3.2.
 NCVis: Noise Contrastive Approach for Scalable Visualization. External Links: 2001.11411 Cited by: §3.3.
 Cited by: §2.1.
 Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
 Mincut Pooling in Graph Neural Networks. arXiv preprint arXiv:1907.00481. Cited by: §1, §2.2, §5.3.1, §5.3.1, 3rd item.
 PageRank as a Function of the Damping Factor. In Proceedings of the 14th international conference on World Wide Web, pp. 557–566. Cited by: §4.3.2.
 Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
 Towards Sparse Hierarchical Graph Classifiers. arXiv preprint arXiv:1811.01287. Cited by: §2.2.
 PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures. stat 1050, pp. 17. Cited by: §2.3.
 Statistical Analysis and Parameter Selection for Mapper. The Journal of Machine Learning Research 19 (1), pp. 478–516. Cited by: §4.2.
 An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists. arXiv preprint arXiv:1710.04019. Cited by: §1.
 UC Berkeley CS267  Lecture 20: Partitioning Graphs without Coordinate Information II. External Links: Link Cited by: §4.1.
 Motif Simplification: Improving Network Visualization Readability with Fan, Connector, and Clique Glyphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI â13, New York, NY, USA, pp. 3247â3256. External Links: ISBN 9781450318990, Link, Document Cited by: §2.1.
 Edge Compression Techniques for Visualization of Dense Directed Graphs. IEEE Transactions on Visualization and Computer Graphics 19 (12), pp. 2596–2605. External Links: Document, ISSN 21609306 Cited by: §2.1.
 Persistent Homology—a Survey. Discrete & Computational Geometry  DCG 453, pp. . External Links: Document Cited by: §2.3.
 Algebraic Connectivity of Graphs. Czechoslovak mathematical journal 23 (2), pp. 298–305. Cited by: Proposition 4.1.
 Topology of Learning in Artificial Neural Networks. Cited by: §2.3.
 A look at the topology of convolutional neural networks. CoRR abs/1810.03234. External Links: Link, 1810.03234 Cited by: §2.3.
 An open graph visualization system and its applications to software engineering. Software: practice and experience 30 (11), pp. 1203–1233. Cited by: §2.1.
 Graph UNets. In International Conference on Machine Learning, pp. 2083–2092. Cited by: §2.2, §5.3.1.
 Exploring Network Structure, Dynamics, and Function using NetworkX. Technical report Los Alamos National Lab.(LANL), Los Alamos, NM (United States). Cited by: §2.1, §4.3.2.
 Mapper on Graphs for Network Visualization. External Links: 1804.11242 Cited by: Figure 2, §2.1, §2.3, §3.1, §4.1.
 MOG: Mapper on Graphs for Relationship Preserving Clustering. arXiv preprint arXiv:1804.11242. Cited by: §3.2.
 Deep Learning with Topological Signatures. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPSâ17, Red Hook, NY, USA, pp. 1633â1643. External Links: ISBN 9781510860964 Cited by: §2.3.
 Graph Filtration Learning. CoRR abs/1905.10996. External Links: Link, 1905.10996 Cited by: §2.3.
 AttPool: Towards Hierarchical Feature Representation in Graph Convolutional Networks via Attention Mechanism. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6480–6489. Cited by: §2.2.
 Principal component analysis. Springer Verlag, New York. Cited by: §3.3.
 Benchmark Data Sets for Graph Kernels. External Links: Link Cited by: §5.1.
 Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
 SemiSupervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2, §4.3.1.
 SelfAttention Graph Pooling. In International Conference on Machine Learning, pp. 3734–3743. Cited by: §2.2.
 CS224W: Social and Information Network Analysis  Graph Clustering. External Links: Link Cited by: §4.1.
 Clique pooling for graph classification. arXiv preprint arXiv:1904.00374. Cited by: §2.2.
 Graph Convolutional Networks with EigenPooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §2.2.
 The PageRank Citation Ranking: Bringing Order to the Web.. Technical report Stanford InfoLab. Cited by: §1, §4.3.2.
 ASAP: Adaptive Structure Aware Pooling for Learning Hierarchical Graph Representations. arXiv preprint arXiv:1911.07979. Cited by: §2.2.
 The Graph Neural Network Model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
 Collective Classification in Network Data. AI magazine 29 (3), pp. 93–93. Cited by: §3.5, §5.1.
 Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. The Eurographics Association (eng). External Links: Document Cited by: §1, §3.1.
 NodeXL: a free and open network overview, discovery and exploration addin for Excel 2007/2010. Cited by: §2.1.
 Visualizing Data using tSNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §3.3.
 Deep Graph Infomax. arXiv preprint arXiv:1809.10341. Cited by: §3.4.
 GNNExplainer: Generating Explanations for Graph Neural Networks. In Advances in Neural Information Processing Systems, pp. 9240–9251. Cited by: §2.1.
 Hierarchical Graph Representation Learning with Differentiable Pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1, §2.2, §5.3.1.