Graphons, mergeons, and so on!
Abstract
In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the “correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.
Graphons, mergeons, and so on!
Justin Eldridge Mikhail Belkin Yusu Wang The Ohio State University {eldridge, mbelkin, yusu}@cse.ohiostate.edu
noticebox[b]29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float
1 Introduction
A fundamental problem in the theory of clustering is that of defining a cluster. There is no single answer to this seemingly simple question. The right approach depends on the nature of the data and the proper modeling assumptions. In a statistical setting where the objects to be clustered come from some underlying probability distribution, it is natural to define clusters in terms of the distribution itself. The task of a clustering, then, is twofold – to identify the appropriate cluster structure of the distribution and to recover that structure from a finite sample. Thus we would like to say that a clustering is good if it is in some sense close to the ideal structure of the underlying distribution, and that a clustering method is consistent if it produces clusterings which converge to the true clustering, given larger and larger samples. Proving the consistency of a clustering method deepens our understanding of it, and provides justification for using the method in the appropriate setting.
In this work, we consider the setting in which the objects to be clustered are the vertices of a graph sampled from a graphon – a very general random graph model of significant recent interest. We develop a statistical theory of graph clustering in the graphon model; To the best of our knowledge, this is the first general consistency framework developed for such a rich family of random graphs. The specific contributions of this paper are threefold. First, we define the clusters of a graphon. Our definition results in a graphon having a tree of clusters, which we call its graphon cluster tree. We introduce an object called the mergeon which is a particular representation of the graphon cluster tree that encodes the heights at which clusters merge. Second, we develop a notion of consistency for graph clustering algorithms in which a method is said to be consistent if its output converges to the graphon cluster tree. Here the graphon setting poses subtle yet fundamental challenges which differentiate it from classical clustering models, and which must be carefully addressed. Third, we prove the existence of consistent clustering algorithms. In particular, we provide sufficient conditions under which a graphon estimator leads to a consistent clustering method. We then identify a specific practical algorithm which satisfies these conditions, and in doing so present a simple graph clustering algorithm which provably recovers the graphon cluster tree.
Related work. Graphons are objects of significant recent interest in graph theory, statistics, and machine learning. The theory of graphons is rich and diverse; A graphon can be interpreted as a generalization of a weighted graph with uncountably many nodes, as the limit of a sequence of finite graphs, or, more importantly for the present work, as a very general model for generating unweighted, undirected graphs. Conveniently, any graphon can be represented as a symmetric, measurable function , and it is this representation that we use throughout this paper.
The graphon as a graph limit was introduced in recent years by Lovász and Szegedy [2006], Borgs et al. [2008], and others. The interested reader is directed to the book by Lovász [2012] on the subject. There has also been a considerable recent effort to produce consistent estimators of the graphon, including the work of Wolfe and Olhede [2013], Chan and Airoldi [2014], Airoldi et al. [2013], Rohe et al. [2011], and others. We will analyze a simple modification of the graphon estimator proposed by Zhang et al. [2015] and show that it leads to a graph clustering algorithm which is a consistent estimator of the graphon cluster tree.
Much of the previous statistical theory of graph clustering methods assumes that graphs are generated by the socalled stochastic blockmodel. The simplest form of the model generates a graph with nodes by assigning each node, randomly or deterministically, to one of two communities. An edge between two nodes is added with probability if they are from the same community and with probability otherwise. A graph clustering method is said to achieve exact recovery if it identifies the true community assignment of every node in the graph with high probability as . The blockmodel is a special case of a graphon model, and our notion of consistency will imply exact recovery of communities.
Stochastic blockmodels are widely studied, and it is known that, for example, spectral methods like that of McSherry [2001] are able to recover the communities exactly as , provided that and remain constant, or that the gap between them does not shrink too quickly. For a summary of consistency results in the blockmodel, see Abbe et al. [2015], which also provides informationtheoretic thresholds for the conditions under which exact recovery is possible. In a related direction, Balakrishnan et al. [2011] examines the ability of spectral clustering to withstand noise in a hierarchical block model.
The density setting. The problem of defining the underlying cluster structure of a probability distribution goes back to Hartigan [1981] who considered the setting in which the objects to be clustered are points sampled from a density . In this case, the high density clusters of are defined to be the connected components of the upper level sets for any . The set of all such clusters forms the socalled density cluster tree. Hartigan [1981] defined a notion of consistency for the density cluster tree, and proved that singlelinkage clustering is not consistent. In recent years, Chaudhuri and Dasgupta [2010] and Kpotufe and Luxburg [2011] have demonstrated methods which are Hartigan consistent. Eldridge et al. [2015] introduced a distance between a clustering of the data and the density cluster tree, called the merge distortion metric. A clustering method is said to be consistent if the trees it produces converge in merge distortion to density cluster tree. It is shown that convergence in merge distortion is stronger than Hartigan consistency, and that the method of Chaudhuri and Dasgupta [2010] is consistent in this stronger sense.
In the present work, we will be motivated by the approach taken in Hartigan [1981] and Eldridge et al. [2015]. We note, however, that there are significant and fundamental differences between the density case and the graphon setting. Specifically, it is possible for two graphons to be equivalent in the same way that two graphs are: up to a relabeling of the vertices. As such, a graphon is a representative of an equivalence class of graphons modulo appropriately defined relabeling. It is therefore necessary to define the clusters of in a way that does not depend upon the particular representative used. A similar problem occurs in the density setting when we wish to define the clusters not of a single density function, but rather of a class of densities which are equal almost everywhere; Steinwart [2011] provides an elegant solution. But while the domain of a density is equipped with a meaningful metric – the mass of a ball around a point is the same under two equivalent densities – the ambient metric on the vertices of a graphon is not useful. As a result, approaches such as that of Steinwart [2011] do not directly apply to the graphon case, and we must carefully produce our own. Additionally, we will see that the procedure for sampling a graph from a graphon involves latent variables which are in principle unrecoverable from data. These issues have no analogue in the classical density setting, and present very distinct challenges.
Miscellany. For simplicity, most of the (rather involved) technical details are in the appendix. We will use to denote the set , for the symmetric difference, for the Lebesgue measure on , and bold letters to denote random variables.
2 The graphon model
In order to discuss the statistical properties of a graph clustering algorithm, we must first model the process by which graphs are generated. Formally, a random graph model is a sequence of random variables such that the range of consists of undirected, unweighted graphs with node set , and the distribution of is invariant under relabeling of the nodes – that is, isomorphic graphs occur with equal probability. A random graph model of considerable recent interest is the graphon model, in which the distribution over graphs is determined by a symmetric, measurable function called a graphon. Informally, a graphon may be thought of as the weight matrix of an infinite graph whose node set is the continuous unit interval, so that represents the weight of the edge between nodes and .
Interpreting as a probability suggests the following graph sampling procedure: To draw a graph with nodes, we first select points at random from the uniform distribution on – we can think of these as being random “nodes” in the graphon. We then sample a random graph on node set by admitting the edge with probability ; by convention, selfedges are not sampled. It is important to note that while we begin by drawing a set of nodes from the graphon, the graph as given to us is labeled by integers. Therefore, the correspondence between node in the graph and node in the graphon is latent.
It can be shown that this sampling procedure defines a distribution on finite graphs, such that the probability of graph is given by
(1) 
For a fixed choice of , the integrand represents the likelihood that the graph is sampled when the probability of the edge is assumed to be . By integrating over all possible choices of , we obtain the probability of the graph.
A very general class of random graph models may be represented as graphons. In particular, a random graph model is said to be consistent if the random graph obtained by deleting node from has the same distribution as . A random graph model is said to be local if whenever are disjoint, the random subgraphs of induced by and are independent random variables. A result of Lovász and Szegedy [2006] is that any consistent, local random graph model is equivalent to the distribution on graphs defined by for some graphon ; the converse is true as well. That is, any such random graph model is equivalent to a graphon.
A particular random graph model is not uniquely defined by a graphon – it is clear from Equation 1 that two graphons and which are equal almost everywhere (i.e., differ on a set of measure zero) define the same distribution on graphs. In fact, the distribution defined by is unchanged by “relabelings” of ’s nodes. More formally, if is the sigmaalgebra of Lebesgue measurable subsets of and is the Lebesgue measure, we say that a relabeling function is measure preserving if for any measurable set , . We define the relabeled graphon by . By analogy with finite graphs, we say that graphons and are weakly isomorphic if they are equivalent up to relabeling, i.e., if there exist measure preserving maps and such that almost everywhere. Weak isomorphism is an equivalence relation, and most of the important properties of a graphon in fact belong to its equivalence class. For instance, a powerful result of Lovász [2012] is that two graphons define the same random graph model if and only if they are weakly isomorphic.
An example of a graphon is shown in Figure 0(a). It is conventional to plot the graphon as one typically plots an adjacency matrix: with the origin in the upperleft corner. Darker shades correspond to higher values of . Figure 0(b) depicts a graphon which is weakly isomorphic to . In particular, is the relabeling of by the measure preserving transformation . As such, the graphons shown in Figures 0(a) and 0(b) define the same distribution on graphs. Figure 0(c) shows the adjacency matrix of a graph of size sampled from the distribution defined by the equivalence class containing and . Note that it is in principle not possible to determine from alone which graphon or it was sampled from, or to what node in a particular column of corresponds to.
3 The graphon cluster tree
We now identify the cluster structure of a graphon. We will define a graphon’s clusters such that they are analogous to the maximallyconnected components of a finite graph. It turns out that the collection of all clusters has hierarchical structure; we call this object the graphon cluster tree. We propose that the goal of clustering in the graphon setting is the recovery of the graphon cluster tree.
Connectedness and clusters. Consider a finite weighted graph. It is natural to cluster the graph into connected components. In fact, because of the weighted edges, we can speak of the clusters of the graph at various levels. More precisely, we say that a set of nodes is internally connected – or, from now on, just connected – at level if for every pair of nodes in there is a path between them such that every node along the path is also in , and the weight of every edge in the path is at least . Equivalently, is connected at level if and only if for every partitioning of into disjoint, nonempty sets and there is an edge of weight or greater between and . The clusters at level are then the largest connected components at level .
A graphon is, in a sense, an infinite weighted graph, and we will define the clusters of a graphon using the example above as motivation. In doing so, we must be careful to make our notion robust to changes of the graphon on a set of zero measure, as such changes do not affect the graph distribution defined by the graphon. We base our definition on that of Janson [2008], who defined what it means for a graphon to be connected as a whole. We extend the definition in Janson [2008] to speak of the connectivity of subsets of the graphon’s nodes at a particular height. Our definition is directly analogous to the notion of internal connectedness in finite graphs.
Definition 1 (Connectedness).
Let be a graphon, and let be a set of positive measure. We say that is disconnected at level if there exists a measurable such that , and almost everywhere on . Otherwise, we say that is connected at level .
We now identify the clusters of a graphon; as in the finite case, we will frame our definition in terms of maximallyconnected components. We begin by gathering all subsets of which should belong to some cluster at level . Naturally, if a set is connected at level , it should be in a cluster at level ; for technical reasons, we will also say that a set which is connected at all levels (though perhaps not at ) should be contained in a cluster at level , as well. That is, for any , the collection of sets which should be contained in some cluster at level is Now suppose , and that there is a set such that . Naturally, the cluster to which belongs should also contain and , since both are subsets of . We will therefore consider and to be equivalent, in the sense that they should be contained in the same cluster at level . More formally, we define a relation on by It can be verified that is an equivalence relation on ; see Claim 4 in Appendix B.
Each equivalence class in the quotient space . consists of connected sets which should intuitively be clustered together at level . Naturally, we will define the clusters to be the largest elements of each class; in some sense, these are the maximallyconnected components at level . More precisely, suppose is such an equivalence class. It is clear that in general no single member can contain all other members of , since adding a null set (i.e., a set of measure zero) to results in a larger set which is nevertheless still a member of . However, we can find a member which contains all but a null set of every other set in . More formally, we say that is an essential maximum of the class if and for every , . is of course not unique, but it is unique up to a null set; i.e., for any two essential maxima of , we have . We will write the set of essential maxima of as ; the fact that the essential maxima are welldefined is proven in Claim 5 in Appendix B. We then define clusters as the maximal members of each equivalence class in :
Definition 2 (Clusters).
The set of clusters at level in , written , is defined to be the countable collection
Note that a cluster of a graphon is not a subset of the unit interval per se, but rather an equivalence class of subsets which differ only by null sets. It is often possible to treat clusters as sets rather than equivalence classes, and we may write , , etc., without ambiguity. In addition, if is a measure preserving transformation, then is welldefined.
For a concrete example of our notion of a cluster, consider the graphon depicted in Figure 0(a). , , and represent sets of the graphon’s nodes. By our definitions there are three clusters at level : , , and . Clusters and merge into a cluster at level , while remains a separate cluster. Everything is joined into a cluster at level .
We have taken care to define the clusters of a graphon in such a way as to be robust to changes of measure zero to the graphon itself. In fact, clusters are also robust to measure preserving transformations. The proof of this result is nontrivial, and comprises Appendix C.
claimclaimmptclusters Let be a graphon and a measure preserving transformation. Then is a cluster of at level if and only if there exists a cluster of at level such that .
Cluster trees and mergeons. The set of all clusters of a graphon at any level has hierarchical structure in the sense that, given any pair of distinct clusters and , either one is “essentially” contained within the other, i.e., , or , or they are “essentially” disjoint, i.e., , as is proven by Claim 3 in Appendix B. Because of this hierarchical structure, we call the set of all clusters from any level of the graphon the graphon cluster tree of . It is this tree that we hope to recover by applying a graph clustering algorithm to a graph sampled from .
We may naturally speak of the height at which pairs of distinct clusters merge in the cluster tree. For instance, let and be distinct clusters of . We say that the merge height of and is the level at which they are joined into a single cluster, i.e., However, while the merge height of clusters is welldefined, the merge height of individual points is not. This is because the cluster tree is not a collection of sets, but rather a collection of equivalence classes of sets, and so a point does not belong to any one cluster more than any other. Note that this is distinct from the classical density case considered in Hartigan [1981], Chaudhuri and Dasgupta [2010], and Abbe et al. [2015], where the merge height of any pair of points is welldefined.
Nevertheless, consider a measurable function which assigns a merge height to every pair of points. While the value of on any given pair is arbitrary, the value of on sets of positive measure is constrained. Intuitively, if is a cluster at level , then we must have almost everywhere on . If satisfies this constraint for every cluster we call a mergeon for , as it is a graphon which determines a particular choice for the merge heights of every pair of points in . More formally:
Definition 3 (Mergeon).
Let be a cluster tree. A mergeon^{1}^{1}1The definition given here involves a slight abuse of notation. For a precise – but more technical – version, see Section A.2. of is a graphon such that for all , where .
An example of a mergeon and the cluster tree it represents is shown in Figure 2. In fact, the cluster tree depicted is that of the graphon from Figure 0(a). The mergeon encodes the height at which clusters , , and merge. In particular, the fact that everywhere on represents the merging of and at level in .
It is clear that in general there is no unique mergeon representing a graphon cluster tree, however, the above definition implies that two mergeons representing the same cluster tree are equal almost everywhere. Additionally, we have the following two claims, whose proofs are in Appendix B.
claimclaimmergeonequivalence Let be a cluster tree, and suppose is a mergeon representing . Then if and only if is a cluster in at level . In other words, the cluster tree of is also .
claimclaimmptmergeon Let be a graphon and a mergeon of the cluster tree of . If is a measure preserving transformation, then is a mergeon of the cluster tree of .
4 Notions of consistency
We have so far defined the sense in which a graphon has hierarchical cluster structure. We now turn to the problem of determining whether a clustering algorithm is able to recover this structure when applied to a graph sampled from a graphon. Our approach is to define a distance between the infinite graphon cluster tree and a finite clustering. We will then define consistency by requiring that a consistent method converge to the graphon cluster tree in this distance for all inputs minus a set of vanishing probability.
Merge distortion. A hierarchical clustering of a set – or, from now on, just a clustering of – is hierarchical collection of subsets of such that and for all , either , , or . Suppose is a clustering of a finite set consisting of graphon nodes; i.e, . How might we measure the distance between this clustering and a graphon cluster tree ? Intuitively, the two trees are close if every pair of points in merges in at about the same level as they merge in . But this informal description faces two problems: First, is a collection of equivalence classes of sets, and so the height at which any pair of points merges in is not defined. Recall, however, that the cluster tree has an alternative representation as a mergeon. A mergeon does define a merge height for every pair of nodes in a graphon, and thus provides a solution to this first issue. Second, the clustering is not equipped with a height function, and so the height at which any pair of points merges in is also undefined. Following Eldridge et al. [2015], our approach is to induce a merge height function on the clustering using the mergeon in the following way:
Definition 4 (Induced merge height).
Let be a mergeon, and suppose is a finite subset of . Let be a clustering of . The merge height function on induced by is defined by for every , where denotes the smallest cluster which contains both and .
We measure the distance between a clustering and the cluster tree using the merge distortion:
Definition 5.
Let be a mergeon, a finite subset of , and a clustering of . The merge distortion is defined by .
Defining the induced merge height and merge distortion in this way leads to an especially meaningful interpretation of the merge distortion. In particular, if the merge distortion between and is , then any two clusters of which are separated at level but merge below level are correctly separated in the clustering . A similar result guarantees that a cluster in is connected in at within of the correct level. For a precise statement of these results, see Section A.4 in Section A.4.
The label measure. We will use the merge distortion to measure the distance between , a hierarchical clustering of a graph, and , the graphon cluster tree. Recall, however, that the nodes of a graph sampled from a graphon have integer labels. That is, is a clustering of , and not of a subset of . Hence, in order to apply the merge distortion, we must first relabel the nodes of the graph, placing them in direct correspondence to nodes of the graphon, i.e., points in .
Recall that we sample a graph of size from a graphon by first drawing points uniformly at random from the unit interval. We then generate a graph on node set by connecting nodes and with probability . However, the nodes of the sampled graph are not labeled by , but rather by the integers . Thus we may think of as being the “true” latent label of node . In general the latent node labeling is not recoverable from data, as is demonstrated by the figure to the right. We might suppose that the graph shown is sampled from the graphon above it, and that node 1 corresponds to , node 2 to , node 3 to , and node to . However, it is just as likely that node 4 corresponds to , and so neither labeling is more “correct”. It is clear, though, that some labelings are less likely than others. For instance, the existence of the edge makes it impossible that 1 corresponds to and 2 to , since is zero.
Therefore, given a graph sampled from a graphon, there are many possible relabelings of which place its nodes in correspondence with nodes of the graphon, but some are more likely than others. The merge distortion depends which labeling of we assume, but, intuitively, a good clustering of will have small distortion with respect to highly probable labelings, and only have large distortion on improbable labelings. Our approach is to assign a probability to every pair of a graph and possible labeling. We will thus be able to measure the probability mass of the set of pairs for which a method performs poorly, i.e., results in a large merge distortion.
More formally, let denote the set of all undirected, unweighted graphs on node set , and let be the sigmaalgebra of Lebesguemeasurable subsets of . A graphon induces a unique product measure defined on the product sigmaalgebra such that for all and :
, 
where represents the edge set of the graph . We recognize as the integrand in Equation 1 for the probability of a graph as determined by a graphon. If is fixed, integrating over all gives the probability of under the model defined by .
We may now formally define our notion of consistency. First, some notation: If is a clustering of and , write to denote the relabeling of by , in which is replaced by in every cluster. Then if is a hierarchical graph clustering method, is a clustering of , and denotes the merge function induced on by .
Definition 6 (Consistency).
Let be a graphon and be a mergeon of . A hierarchical graph clustering method is said to be a consistent estimator of the graphon cluster tree of if for any fixed , as ,
The choice of mergeon for the graphon does not affect consistency, as any two mergeons of the same graphon differ on a set of measure zero. Furthermore, consistency is with respect to the random graph model, and not to any particular graphon representing the model. The following claim, the proof of which is in Appendix B, makes this precise.
claimclaimconsistencympt Let be a graphon and a measure preserving transformation. A clustering method is a consistent estimator of the graphon cluster tree of if and only if it is a consistent estimator of the graphon cluster tree of .
Consistency and the blockmodel. If a graph clustering method is consistent in the sense defined above, it is also consistent in the stochastic blockmodel; i.e., it ensures strict recovery of the communities with high probability as the size of the graphs grow large. For instance, suppose is a stochastic blockmodel graphon with along the blockdiagonal and everywhere else. has two clusters at level , merging into one cluster at level . When the merge distortion between the graphon cluster tree and a clustering is less than , which will eventually be the case with high probability if the method is consistent, the two clusters are totally disjoint in ; this implication is made precise by Section A.4 in Section A.4.
5 Consistent algorithms
We now demonstrate that consistent clustering methods exist. We present two results: First, we show that any method which is capable of consistently estimating the probability of each edge in a random graph leads to a consistent clustering method. We then analyze a modification of an existing algorithm to show that it consistently estimates edge probabilities. As a corollary, we identify a graph clustering method which satisfies our notion of consistency. Our results will be for graphons which are piecewise Lipschitz (or weakly isomorphic to a piecewise Lipschitz graphon):
[Piecewise Lipschitz]defndefnlipschitz We say that is a block partition if each is an open, halfopen, or closed interval in with positive measure, is empty whenever , and . We say that a graphon is piecewise Lipschitz if there exists a set of blocks such that for any and in , .
Our first result concerns methods which are able to consistently estimate edge probabilities in the following sense. Let be an ordered set of uniform random variables drawn from the unit interval. Fix a graphon , and let be the random matrix whose entry is given by . We say that is the random edge probability matrix. Assuming that has structure, it is possible to estimate from a single graph sampled from . We say that an estimator of is consistent in maxnorm if, for any , . The following nontrivial theorem, whose proof comprises Appendix D, states that any estimator which is consistent in this sense leads to a consistent clustering algorithm:
theoremthmconditions Let be a piecewise Lipschitz graphon. Let be a consistent estimator of in maxnorm. Let be the clustering method which performs singlelinkage clustering using as a similarity matrix. Then is a consistent estimator of the graphon cluster tree of .
Estimating the matrix of edge probabilities has been a direction of recent research, however we are only aware of results which show consistency in mean squared error; That is, the literature contains estimators for which tends to zero in probability. One practical method is the neighborhood smoothing algorithm of Zhang et al. [2015]. The method constructs for each node in the graph a neighborhood of nodes which are similar to in the sense that for every , the corresponding column of the adjacency matrix is close to in a particular distance. is clearly not a good estimate for the probability of the edge , as it is either zero or one, however, if the graphon is piecewise Lipschitz, the average over will intuitively tend to the true probability. Like others, the method of Zhang et al. [2015] is proven to be consistent in mean squared error. Since Theorem 5 requires consistency in maxnorm, we analyze a slight modification of this algorithm and show that it consistently estimates in this stronger sense. The technical details are in Appendix E.
Theorem 1.
If the graphon is piecewise Lipschitz, the modified neighborhood smoothing algorithm in Appendix E is a consistent estimator of in maxnorm.
As a corollary, we identify a practical graph clustering algorithm which is a consistent estimator of the graphon cluster tree. The algorithm is shown in Algorithm 1, and details are in Section E.2. Appendix F contains experiments in which the algorithm is applied to real and synthetic data.
Corollary 1.
If the graphon is piecewise Lipschitz, Algorithm 1 is a consistent estimator of the graphon cluster tree of .
6 Discussion
We have presented a consistency framework for clustering in the graphon model and demonstrated that a practical clustering algorithm is consistent. We now identify two interesting directions of future research. First, it would be interesting to consider the extension of our framework to sparse random graphs; many realworld networks are sparse, and the graphon generates dense graphs. Recently, however, sparse models which extend the graphon have been proposed; see Caron and Fox [2014], Borgs et al. [2016]. It would be interesting to see what modifications are necessary to apply our framework in these models.
Second, it would be interesting to consider alternative ways of defining the ground truth clustering of a graphon. Our construction is motivated by interpreting the graphon not only as a random graph model, but also as a similarity function, which may not be desirable in certain settings. For example, consider a “bipartite” graphon , which is one along the blockdiagonal and zero elsewhere. The cluster tree of consists of a single cluster at all levels, whereas the ideal bipartite clustering has two clusters. Therefore, consider applying a transformation to which maps it to a “similarity” graphon. The goal of clustering then becomes the recovery of the cluster tree of given a random graph sampled from . For instance, let , where is the operator square of the bipartite graphon . The cluster tree of has two clusters at all positive levels, and so represents the desired ground truth. In general, any such transformation leads to a different clustering goal. We speculate that, with minor modification, the framework herein can be used to prove consistency results in a wide range of graph clustering settings.
Acknowledgements. This work was supported by NSF grants IIS1550757 & DMS1547357.
References
 Abbe et al. [2015] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory, 62(1):471–487, 2015.
 Airoldi et al. [2013] Edoardo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 692–700. Curran Associates, Inc., 2013.
 Ash and DoleansDade [2000] Robert B Ash and Catherine DoleansDade. Probability and measure theory. Academic Press, 2000.
 Balakrishnan et al. [2011] Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, and Aarti Singh. Noise thresholds for spectral clustering. In J ShaweTaylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 954–962. Curran Associates, Inc., 2011.
 Borgs et al. [2008] C Borgs, J T Chayes, L Lovász, V T Sós, and K Vesztergombi. Convergent sequences of dense graphs I: Subgraph frequencies, metric properties and testing. Adv. Math., 219(6):1801–1851, 20 December 2008.
 Borgs et al. [2016] Christian Borgs, Jennifer T Chayes, Henry Cohn, and Nina Holden. Sparse exchangeable graphs and their limits via graphon processes. arXiv:1601.07134, 26 January 2016.
 Caron and Fox [2014] François Caron and Emily B Fox. Sparse graphs using exchangeable random measures. arXiv:1401.1137, 6 January 2014.
 Chan and Airoldi [2014] Stanley Chan and Edoardo Airoldi. A consistent histogram estimator for exchangeable graph models. In Proceedings of The 31st International Conference on Machine Learning, pages 208–216, 2014.
 Chaudhuri and Dasgupta [2010] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, pages 343–351, 2010.
 Eldridge et al. [2015] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Beyond hartigan consistency: Merge distortion metric for hierarchical clustering. In Proceedings of The 28th Conference on Learning Theory, pages 588–606, 2015.
 Girvan and Newman [2002] M Girvan and M E J Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. U. S. A., 99(12):7821–7826, 11 June 2002.
 Hartigan [1981] J. A. Hartigan. Consistency of Single Linkage for HighDensity Clusters. Journal of the American Statistical Association, 76(374):388–394, June 1981. ISSN 01621459. doi: 10.1080/01621459.1981.10477658.
 Janson [2008] Svante Janson. Connectedness in graph limits. arXiv:0802.3795, 26 February 2008.
 Kpotufe and Luxburg [2011] Samory Kpotufe and Ulrike V. Luxburg. Pruning nearest neighbor cluster trees. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 225–232, New York, NY, USA, 2011. ACM.
 Lovász [2012] László Lovász. Large networks and graph limits, volume 60. American Mathematical Soc., 2012.
 Lovász and Szegedy [2006] László Lovász and Balázs Szegedy. Limits of dense graph sequences. J. Combin. Theory Ser. B, 96(6):933–957, November 2006.
 McSherry [2001] F McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 529–537, October 2001.
 Rohe et al. [2011] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the highdimensional stochastic blockmodel. Ann. Stat., 39(4):1878–1915, August 2011.
 Steinwart [2011] I Steinwart. Adaptive density level set clustering. In Proceedings of The 24th Conference on Learning Theory, pages 703–737, 2011.
 Wolfe and Olhede [2013] Patrick J Wolfe and Sofia C Olhede. Nonparametric graphon estimation. arXiv:1309.5936, 23 September 2013.
 Zhang et al. [2015] Yuan Zhang, Elizaveta Levina, and Ji Zhu. Estimating network edge probabilities by neighborhood smoothing. arXiv:1509.08588, 29 September 2015.
Appendix A Technical details
a.1 Measurable sets modulo null sets
Let be a measure space. Let be any measurable sets and define to be the relation ; that is, two measurable sets are equivalent under if they differ by a null set. Write for the quotient space of measurable sets by , and denote by the equivalence class containing the set . Throughout, we use script letters such as to denote these equivalence classes of measurable sets modulo null sets.
We can often use the normal set notation to manipulate such classes without ambiguity. For instance, if and are two classes in , we define to be , where and are arbitrary members of and . and are defined similarly. We can define in this manner too; note that the result is an equivalence class in , where the relation is implicitly assumed to be with respect to the product measure . Similarly, we can unambiguously order such equivalence classes. For example, we write to denote .
In some instances it will be more convenient to work with sets as opposed to equivalence classes of sets. In such cases we will use a section map which returns an (often arbitrary) member of the class, .
a.2 A precise definition of a mergeon
In creftypecap 3, we introduce the mergeon of a cluster tree as a graphon satisfying
for all . This definition involves a slight abuse of notation. In particular, is an equivalence class of sets modulo null sets. Therefore, as descibed in the previous subsection, is defined to be an equivalence class of measurable subsets of modulo null sets. On the other hand, is simply a measurable subset of . Therefore, the left and the right of the above equation are two different types of objects, and we are being imprecise in equating them.
A precise statement of this definition is as follows:
Definition 7 (Mergeon, rigorous).
A mergeon of is a graphon such that for all ,
is a null set, where , is the symmetric difference operator, and is an arbitrary section map. Equivalently, a mergeon of is a graphon such that for all ,
where is the equivalence class of measurable subsets of modulo null sets which contains .
In these more rigorous definitions we compare sets to sets, or equivalence classes to equivalence classes, and thus precisely define the mergeon.
a.3 Strict cluster trees and their mergeons
A graphon cluster tree is a hierarchical collection of equivalence classes of sets. It is sometimes useful to instead to work with a hierarchical collection of subsets of . We may always do so by choosing a section map and applying it to every cluster in the cluster tree. Though the choice of representative of a given cluster is often arbitrary, it will sometimes be useful to choose it in such a way that the cluster tree has strictly nested structure, as made precise in the following definition.
Definition 8 (Strict section).
Let be a cluster tree. A strict section is a function which selects a unique representative from each cluster such that if:

,

, and

(Technical condition) .
The strict cluster tree induced by applying to is defined by
creftypecap 9 in Appendix B proves that it is always possible to construct a strict section. Furthermore, given a cluster tree and a strict section, there is a unique mergeon representing the strict cluster tree, defined as follows:
Definition 9 (Strict mergeon).
Let be a cluster tree, and suppose is a strict section for the clusters of . Then is a strict mergeon of the strict cluster tree induced by if, for every ,
Because any two mergeons of the same cluster tree differ only on a null set, we are typically free to assume that a mergeon is strict without much loss. Making this assumption will simplify some statements and proofs.
a.4 Merge distortion and cluster structure
creftypecap 4 introduces the merge height induced on a clustering by a mergeon. There are, of course, other approachs to inducing a merge height on a clustering, but our definition allows for a particular interpretation of the merge distortion in terms of the cluster structure that is recovered by the finite clustering, as the following claim makes precise. It is convenient to state the claim using the notion of a strict mergeon defined in Section A.3; analogous (but equally as strong) claims can be made for general mergeons and cluster trees.
claimclaimconnectedseparated Let be cluster tree, and let be a strict cluster tree obtained by applying a strict section to each cluster of . Let be the strict mergeon representing . Let with each and suppose is a clustering of . Let be the induced merge height on . If , we then have:

Connectedness: If is a cluster of at level and , then the smallest cluster in which contains all of is contained within , where is the cluster of at level which contains .

Separation: If and are two clusters of at level such that and merge at level , then if , the smallest cluster in containing and the smallest cluster containing are disjoint.
The proof of this claim is found in Appendix B.
Appendix B Proofs
In the following proofs we will often work with the clusters of a graphon. Recall that, formally, a cluster is not a subset of , but rather an equivalence class of measurable subsets of which differ by null sets. Nevertheless, we will often speak as though the clusters are in fact subsets of ; we can typically do so without issue. For instance, we might say “ is a cluster of a graphon at level .” Technicallyspeaking, this is incorrect. However, we might interpret the above statement as saying that there exists a cluster at level such that is a representative of .
*
Proof.
Let be an arbitrary representative of the cluster . By definition of the mergeon, all but a null set of is contained within , and therefore almost everywhere on . This implies that is connected at level in , which in turn implies that is contained in some cluster of at level . By definition, is connected in at every level , and so Claim 7 in Appendix B implies that is contained in some cluster of at every level . Claim 8 in Appendix B then implies that is contained in some cluster of at level . In other words, is a cluster of at level , and , so the fact that is contained in a cluster of at level implies that differs from by at most a null set. Hence is a cluster of .
Now suppose is a cluster of at level . Then is connected in at every level , and so Claim 7 implies that is contained in some cluster of at every level . Claim 8 then implies that is contained in some cluster of at level . Let be this cluster. Then the above implies that is a cluster at level in . But , and is a cluster of , so it must be that and differ by a null set, and hence is a cluster of .
∎
*
Proof.
One one hand, the function defined by
is a mergeon of , since is a cluster of if and only if is a cluster of by Section 3 in the body of the paper. Now consider the pullback and its upper level set
which, by definition of the mergeon, is  
It is wellknown that if is a measure preserving map, then the transformation defined by is also measure preserving and measurable. Therefore we have  
Since preimages commute with arbitrary unions:  
Some thought will show that , such that:  
Comparing this to the definition of above, which was a mergeon of , we see that is a mergeon of .
∎
*
Proof.
Let be a mergeon of the cluster tree of and fix any . Consider the set
which is the set of graph/sample pairs for which the merge distortion between the clustering and the mergeon is greater than . Consistency with respect to the cluster tree of requires that as . Now recall that is a mergeon of the cluster tree of , and consider
where is the merge height induced on the clustering by the mergeon . is the set of graph/sample pairs for which the merge distortion between the clustering and the mergeon is greater than . Consistency with respect to the cluster tree of requires that as . It will therefore be sufficient to show that to prove the claim.
Now we compute the measure under of :
where denotes the section of by graph , that is, the set . It is easy to see that , such that:
Since , we have
such that
Now consider the section of by , defined by . It is clear that for every graph . Therefore,
Now, it is a property of measure preserving maps that ; See, for example, Ash and DoleansDade [2000]. Therefore, we have
which proves the claim.
∎
*
Proof.
First we prove connectedness. Let be the smallest cluster in containing . Suppose contains a point from outside of . Let be any two distinct points in . Then necessarily , as, since is strict, if and only if are in the same cluster of at level . Hence the merge distortion is at least , which is a contradiction.
Separation follows from connectedness. Let be the smallest cluster in the clustering containing , and similarly for . Let and be the clusters at level which contain and . Then , since and merge below . Furthermore, by connectedness, and . Hence they are disjoint. ∎
Claim 1.
Let be a measure space, with a finite measure (i.e., ). Let be closed under countable unions. Define the set of essential maxima of by
Then if is nonempty, is nonempty. Furthermore, for any , .
Proof.
The claim holds trivially if is empty, so suppose it is not. Let , and note that is finite since is finite. Then for every , there exists a set such that . Construct the sequence by defining Then for every , since it is the countable union of sets in . Furthermore, , and . Define . Then since it is a countable union of elements of , and by continuity of measure .
First we show that for any set , . Suppose for a contradiction that . We have , such that . But is in , since is closed under unions. This violates the fact that is the supremal measure of any set in , and hence it must be that . Therefore .
Now suppose is an arbitrary element of . We have just seen that must be zero, since . Likewise, . Therefore
where we used the fact that is an additive set function and and disjoint. It is also clear that if and is any null set, then and are also essential maxima. ∎
Claim 2.
Let be a measure space, with a finite measure (i.e., ). Let be closed under countable intersections. Define the set of essential minima of by
Then if is nonempty, is nonempty. Furthermore, for any , .
Proof.
The proof is analogous the that of creftypecap 1 for – we simply switch from a supremum to an infimum and construct a descending sequence. It is therefore omitted.
∎
Claim 3.
Let be a graphon, and suppose and are measurable sets with positive measure, and that each is connected at level in . If , then is connected at level .
Proof.
Suppose and that is disconnected at level . Then, by definition, there exists a measurable set such that and almost everywhere on .
It is either the case that or , as otherwise we would have . If