Averages of Unlabeled Networks: Geometric Characterization and Asymptotic Behavior
It is becoming increasingly common to see large collections of network data objects – that is, data sets in which a network is viewed as a fundamental unit of observation. As a result, there is a pressing need to develop network-based analogues of even many of the most basic tools already standard for scalar and vector data. In this paper, our focus is on averages of unlabeled, undirected networks with edge weights. Specifically, we (i) characterize a certain notion of the space of all such networks, (ii) describe key topological and geometric properties of this space relevant to doing probability and statistics thereupon, and (iii) use these properties to establish the asymptotic behavior of a generalized notion of an empirical mean under sampling from a distribution supported on this space. Our results rely on a combination of tools from geometry, probability theory, and statistical shape analysis. In particular, the lack of vertex labeling necessitates working with a quotient space modding out permutations of labels. This results in a nontrivial geometry for the space of unlabeled networks, which in turn is found to have important implications on the types of probabilistic and statistical results that may be obtained and the techniques needed to obtain them.
Averages of Unlabeled Networks
, , , and
class=MSC] \kwd[Primary ]62E20 \kwd62G20 \kwd[; secondary ]53C20
fundamental domain \kwdFréchet mean \kwdgraphs
Over the past 15-20 years, as the field of network science has exploded with activity, the majority of attention has been focused upon the analysis of (usually large) individual networks. See [21, 23, 29], for example. While it is unlikely that the analysis of individual networks will become any less important in the near future, it is likely that in the context of the modern era of ‘big data’ there will soon be an equal need for the analysis of (possibly large) collections of (sub)networks, i.e., collections of network data objects.
We are already seeing evidence of this emerging trend. For example, the analysis of massive online social networks like Facebook can be facilitated by local analyses, such as through extraction of ego-networks (e.g., ). Similarly, the Functional Connectomes Project, launched a few years ago in imitation of the data-sharing model long-common in computational biology, makes available a large number of fMRI functional connectivity networks for use and study in the context of computational neuroscience (e.g, ). It would seem, therefore, that in the near future networks of small to moderate size will themselves become standard, high-level data objects.
Faced with databases in which networks are the fundamental unit of data, it will be necessary to have in place a network-based analogue of the ‘Statistics 101’ tool box, extending standard tools for scalar and vector data to network data objects. The extension of such classical tools to network-based datasets, however, is not immediate, since networks are not inherently Euclidean objects. Rather, formally they are combinatorial objects, defined simply through two sets, of vertices and edges, respectively, possibly with an additional corresponding set of weights. Nevertheless, initial work in this area demonstrates that networks can be associated with certain natural Euclidean subspaces and furthermore demonstrates that through a combination of tools from geometry, probability theory, and statistical shape analysis it should be possible to develop a comprehensive, mathematically rigorous, and computationally feasible framework for producing the desired analogues of classical tools.
For example, in our recent work  we have characterized the geometry of the space of all labeled, undirected networks with edge weights, i.e., consisting of graphs , for weights , where equality with zero holds if and only if . This characterization allowed us in turn to establish a central limit theorem for an appropriate notion of a network empirical mean, as well as analogues of classical one- and two-sample hypothesis testing procedures. Other results of this type include additional work on asymptotics for network empirical means  and regression modeling with a network response variable, where for the latter there have been both frequentist  and Bayesian  proposals. Work in this area continues at a quick pace – see, for example,  which proposes a classification model based on network-valued inputs and  which proposes a nonparametric Bayes model for distributions on populations of networks. Earlier efforts in this space have focused on the specific case of trees. Contributions of this nature include work on central limit theorems in the space of phylogenetic trees [10, 4] and work by Marron and colleagues [32, 3] in the context of so-called object-oriented data analysis with trees.
To the best of our knowledge, all such work to date pertains to the case of labeled networks: that is, to networks in which the vertices have unique labels, e.g., . In fact, unlabeled networks have received decidedly less attention in the network science literature as a whole but nevertheless arise in various important settings. The quintessential example of how such networks may arise in practice arguably is that of the study of ego-centric network structure in social network analysis. There, traditionally, individuals (‘egos’) are surveyed for a list of other individuals (‘alters’) with whom they share a certain relationship (e.g., friendship, colleague, etc.) and only common patterns across networks in the structure of the relationships among the individuals within each network are of interest. This leads to analyses that either ignore vertex labels or for which vertex labels are simply not available (e.g., through de-identification). See , for example.
In this paper, our focus is on averages of unlabeled, undirected networks with edge weights. Adopting a perspective similar to that in our previous work , we (i) characterize a certain notion of the space of all such networks, (ii) describe key topological and geometric properties of this space relevant to doing probability and statistics thereupon, and (iii) use these properties to establish the asymptotic behavior of a generalized notion of an empirical mean under sampling from a distribution supported on this space. In particular, adopting the notion of a Fréchet mean, we establish a corresponding strong law of large numbers and a central limit theorem. In contrast to , where the corresponding space of networks was found to form a smooth manifold, here the lack of vertex labeling necessitates working with a quotient space modding out permutations of labels. As a result, we have only an orbifold – a more general geometric structure – which in turn is found to have important implications on the types of probabilistic and statistical results that may be obtained and the techniques needed to obtain them.
The nature of our work is in the spirit of statistics on manifolds and statistical shape analysis, which employs the geometry of manifolds or shape spaces for defining Fréchet means and developing large sample theory of their sample counterparts for inference. See  for a rather comprehensive treatment on the subject. Our approach to studying the entire family of networks subject to an equivalence relation under a group action, via forming the associated quotient or moduli space, is a common theme in modern geometry, including gauge theory , symplectic topology , and algebraic geometry [14, 31]. The appearance of orbifolds, often much more complicated than in our case, is quite common. Finally, there is a large literature on graph limits, for which substantial work has been done on analysis of appropriate spaces of networks (e.g., ). But the focus therein typically is, by definition, on the case of a single network asymptotically increasing in size. Here, the focus is on asymptotics in many networks, with the dimension fixed.
The organization of this paper is as follows. In Section 2 we present our characterization of the space of unlabeled networks. Results from our investigation of the asymptotic behavior of the Fréchet empirical mean are then provided in Section 3. While a strong law of large numbers is found to emerge under quite general conditions, establishing just when conditions dictated by the current state of the art for central limit theorems on manifolds hold turns out to be a decidedly more subtle exercise. This latter is the focus of Section 4. Some additional discussion of open problems may be found in Section 5. The Appendices discuss implementation issues for the theoretical results in the paper.
2 The space of unlabeled networks
Our ultimate focus in this paper is on a certain well-defined notion of an ‘average’ on elements drawn randomly from a ‘space’ of unlabeled networks and on the statistical behavior of such averages. Accordingly, we need to establish and understand the relevant topology and geometry of this space. We do so by associating labeled networks with vectors and mapping those to unlabeled networks through the use of equivalence classes in an appropriate quotient space. In this section we provide relevant definitions, characterization, and illustrations of this space of unlabeled networks.
2.1 The topological space of unlabeled networks
Let be a labeled, undirected graph/network with weighted edges and with vertices/nodes. We always think of as having elements, where some of the edge weights can be zero. We think of the edge weight between vertices and as the strength of some unspecified relationship between and .
Let be the group of permutations of A permutation of the vertex labels technically produces a new graph , but with no new information. To define precisely, note that the weight function can be thought of as a symmetric function , with the weight of the edge joining vertex and vertex in . Therefore the action of on is given by
(The inverse guarantees that ) Note that for general , not all permutations of the entries of are of the form as may have distinct permutations and has elements
In summary, is defined to be the graph on vertices with weight function Let be the set of all labeled graphs with vertices. Then the quotient space
is the space of unlabeled graphs, the object we want to study. This means that an unlabeled network is an equivalence class
As we now explain, looks like an explicit subset of , and so is easy to picture. In contrast, the quotient space is difficult to picture. Nevertheless, as we describe in the following paragraphs, the topology of may be characterized through standard point-set topology techniques, with the conclusion that everything works as well as possible. Readers who wish can safely skip to the examples in Section 2.2.
Fix an ordering of the vertices , and take the lexicographic ordering on the set of edges. (Thus if or and ) Given this ordering, we get an injection
where is the weight of the edge of . The image of is the first “octant” The Euclidean metric on pulls back via to the obvious metric on : two networks are close iff their edge weights are close. Similarly, the standard topology on (an open ball in is the intersection of an open -Euclidean ball with ) pulls back to a topology on (This just means that is open iff is open in This makes a homeomorphism.) Just as in , the metric and topology are compatible: a sequence of graphs/weight vectors in converges to a graph/weight vector in the topology of iff the distance from to goes to zero.
Via the bijection , the action of on transfers to an action on First, acts on by if corresponds to the edge and corresponds to the edge . Then acts on by Since we’ve arranged the actions to be compatible with , we get a well defined bijection :
From now on, we just denote by
To complete the topological discussion, we note that is a homeomorphism if we give both sides the quotient topology: for the map taking a graph to its equivalence class, a set is open iff is open in . The quotient topology on is defined similarly.
2.2 Examples of quotient spaces
As a warmup, we first give a simple example of a quotient space resulting from the action of a finite group on a Euclidean space. This particular example is important in providing a relevant non-network analogy to our network-based results. We will revisit it frequently throughout the paper.
The group acts on the plane by rotation counterclockwise by degrees: specifically, for and ,
Thus etc. A point in the quotient space is the set The set is called the orbit of under Note that every orbit is a four element set except for the exceptional orbit
The closed first quadrant is a fundamental domain for this action; i.e., each orbit has a unique representative/element in , except possibly for the orbits of points on the boundary of . Orbits of boundary points like have two representatives in , while the origin of course has only one representative.
If we want to picture a set that is bijective to , we could take e.g. to be minus the positive -axis. This is not so helpful topologically or geometrically, as the points and have close representatives in , while their representatives and are not close in . In particular, the sequence does not converge in , but the orbits converge to in Thus does not give us a good picture of topologically.
In summary, it is much better to keep both positive axes in , and to consider as (in bijection with) with the boundary points and “glued together.” More precisely, we have a bijection
where the denominator indicates that the two point set is one point of , while all other points of correspond to a single point in At the price of this gluing, we now have that is a homeomorphism: points are close in iff they are close in (Technical remarks: gets the quotient topology from the standard topology on and the obvious surjection and “close” refers to the Procrustean distance defined in Section 3.)
Although this seems a little involved, it is quite easy to perform the gluing in in rubber sheet topology: stretching the interior of to allow the gluing of the two axes shows that and hence is a cone. See Figure 1.
We work out in detail the case of a network with three vertices. This is a deceptively easy case, as implies that every permutation of the edge weights comes from a permutation in In higher dimensions, the details are more complicated.
We first describe the quotient space of unlabeled graphs directly, and then find a fundamental domain for inside the space of labeled graphs . The direct approach is more difficult, which motivates our concentrating on fundamental domains in the rest of the paper.
Note that acts freely on i.e., iff The subset
is an open 3-manifold which is dense in and with a free -action, so we get a smooth 3-manifold structure on the open dense set
Now let be the subset of consisting of graphs with ; i.e., the weight of the edge (which is lexicographically the first edge) from vertex to vertex is zero. Then is the subset of the -plane (i.e. ) given by The subgroup of permutations fixing edge fixes . acts freely on ( of) the set of graphs and fixes the diagonal line This is because Thus the quotient space is homeomorphic to a closed pie wedge. We think of as a stratified -manifold: it contains an open, dense set which is a -manifold, the two edges (minus the origin) which are -manifolds, and the origin as a -manifold.
Denote the equivalence class of a point in by Each equivalence class in contains six points. In general, the number of elements in an equivalence class equals , where is the stabilizer subgroup of
The three coordinate planes , etc. get glued under the action of . For example, and Each coordinate plane gets further glued, e.g., gets glued to . Thus the three coordinate planes get glued to one pie wedge. This pie wedge is glued onto as follows: given with , we declare the limit point of this sequence to be We make the similar definition if or This is clearly well defined. We make a similar definition if has , etc. and if has .
From now on, for expository reasons, we drop the automatic conditions from description of subsets of
Similarly, the three planes , get glued together (e.g., ). Note that e.g. if and For example is glued to These three planes intersect at the line . Thinking of the three planes as troughs with edge , the three troughs are glued together. The two sides of a trough are not glued to each other, but are glued to sides of two other troughs. As above, has limit point/is glued to if , etc. In particular, if , this is consistent with the previous gluing.
The final quotient is a stratified -manifold:
The dense -dimensional piece is , which is topologically a -ball.
With increasingly terse notation, the 2d strata are
The 1d strata are
The 0d stratum is
The point in the 1d stratum can be perturbed into a 2d stratum point or or into a 3d stratum point This agrees with the fact that the 1d stratum glues both to a trough (a 2d stratum) and to an open wedge in a coordinate plane, and that this 1d stratum also glues to the big cell.
It is easier to picture the quotient space of unlabeled networks by finding a fundamental domain inside for the action of As in the previous example, and detailed in Section 4, is a closed set such that the quotient map is a continuous surjection, a homeomorphism from the interior of to a dense set of unlabeled networks, and a finite-to-one map on the boundary of . Thus represents bijectively except for some gluings on the boundary. This is illustrated in Figure 2, where Again, the case is deceptively easy, as is a bijection even on
The situation is more complicated for graphs with 4 (or more) vertices. For , if we label the edges as , then the weight vectors and have the same distributions of ones and zeros, but correspond to binary graphs which are not in the same orbit of . In particular, the region is not a fundamental domain for the action of
While a fundamental domain is harder to find in high dimensions (see Section 4), the overall structure of for general is similar to the case, with just increased notation.
The space of unlabeled graphs is a stratified space.
We just sketch the proof, since this result is not used below. We don’t need the technical definition of a stratified space, just a general understanding that consists of a sequence of -dimensional manifolds with boundary, with -dimensional strata glued coherently to (or higher) dimensional strata. The big open cell of dimension is Lower strata are characterized by the number of zero entries and the number of equal nonzero entries. More precisely, say the weight vector has zeros and entries equal to , entries equal to , with all distinct. Then belongs to a stratum of dimension
A higher dimensional stratum has a lower dimensional stratum as part of if a sequence of points in converges to a point in This can occur either by an entry in this sequence going to zero, or by entries in this sequence going to a common positive limit. ∎
By , is PL or Lipschitz homeomorphic to , but the proof does not give a cell decomposition of , much less of
3 Network averages and their asymptotic behavior
In this section we define the mean of a distribution on the space of networks and investigate the asymptotic behavior of the empirical (or sample) mean network based on an i.i.d sample of networks from . Statistical inference can be carried out based on the asymptotic distribution of the empirical mean. We illustrate with an example from hypothesis testing. The results of the previous section, characterizing the topology and geometry of the space of unlabeled networks, are essential for achieving our goals in this section.
3.1 Network averages through Fréchet means
Let be some distribution on a general metric space . One can define the Fréchet function on as
If is finite on and has a unique minimizer
then is called the Fréchet mean of (with respect to the metric ). Otherwise, the minimizers of the Fréchet function form a Fréchet mean set . Given an i.i.d sample on , the empirical Fréchet mean can be defined by replacing with the empirical distribution , that is,
When is a manifold, one can equip with a metric space structure through an embedding into some Euclidean space or employing a Riemannian structure of . Respectively, can be taken to be the Euclidean distance after embedding (extrinsic distance) or the geodesic distance (intrinsic distance), giving rise to extrinsic and intrinsic means. Asymptotic theory for extrinsic and intrinsic analysis has been developed in ,, , and applied to many manifolds of interest (see e.g., , ).
Now take , the space of unlabeled networks with nodes, our space of interest, and let be a distribution on . Given an i.i.d sample from , in order to define the Fréchet mean of and empirical Fréchet mean of , one needs an appropriate choice of distance on . Given the quotient space structure characterized in the previous section, i.e., , a natural choice for the distance is the Procrustean distance , where
for unlabeled networks , with denoting the vectorized representation of a representative network . We recall that is the set of all labeled graphs with vertices and is the group of permutations of .
In order to carry out statistical inference based on , defined with respect to the distance (3.4), some natural and fundamental questions related to and need to be addressed, which we aim to do in the following subsections. Here are some of the most crucial ones:
(Consistency.) What are the consistency properties of the network empirical mean , i.e., is a consistent estimator of the population Fréchet mean ? Can we establish some notion of a law of large numbers for ?
(Uniqueness of Fréchet mean.) This question is concerned with establishing general conditions on for uniqueness of the Fréchet mean . In general this a challenging task – indeed, the lack of general uniqueness conditions for Fréchet means is still one of the main hurdles for carrying out intrinsic analysis on manifolds . To date the most general results in the literature for generic manifolds  force the support of to be a small geodesic ball to guarantee uniqueness of the intrinsic Fréchet mean. We address this question for the space of unlabeled networks in Section 4.
(CLT.) Once conditions for uniqueness of are provided, the next key question is whether one can derive the limiting distribution for for purposes of statistical inference, e.g., proving a central limit type of theorem for , which in turn might be used for hypothesis testing.
We first illustrate the difficult nature of these problems (in particular for question 2 above) through the example in Section 2, by explicitly constructing a distribution on that has non-unique Fréchet means.
Example 3.1 (Example continued).
We can explicitly compute the Fréchet function with respect to . For , is a fundamental domain. For , . Then
where for , and .
The Fréchet mean occurs when , which is difficult to compute in general. Consider the special case
where is a fixed constant, , ; this distribution for is plotted in Figure 3. The minimum for this occurs at
with arbitrary. When is large, . For , . This shows that has a circle’s worth of Fréchet means; the -independence of implies -independence of the Fréchet means. One can see this in Figure 4 where the Fréchet function is minimal and most blue on a circle of radius approximately 13.53 (corresponding to the red circle on the cone in Figure 3).
3.2 A strong law of large numbers
Before establishing the limiting distribution for , a natural first step is to explore the consistency properties of . Drawing on Theorem 3.3 in  for general metric spaces, we have the following result.
Let be a distribution on , let be the set of means of with respect to the Procrustean distance , and let be the set of empirical means with respect to a sample of unlabeled networks . Assume that the Fréchet function is everywhere finite. Then the following holds: (a) the Fréchet mean set is nonempty and compact; (b) for any , there exists a positive integer-valued random variable and a -null set such that
outside of ; (c) if is a singleton, i.e., the Fréchet mean is unique, then is a strongly consistent estimator of , or converges to almost surely.
We first prove that every closed and bounded subset of is compact.
Let be a fundamental domain for the action of on , as defined in Section 4, with the associated projection . This map is continuous and a diffeomorphism on the interior of Take a closed and bounded set in . Because is continuous, is closed. We now show that is also bounded. is contained in a ball centered at some with radius , so for . Now say that the largest entry in (in any ordering of the entries of ) is If the largest entry in (in any ordering) is greater than , then , a contradiction. (This holds because under any permutation of the entries of , one entry in is greater than .) Thus for any choice of , Thus is contained in the ball of radius centered at the origin, and thus is bounded.
Since is a closed subset of , the closed and bounded set is compact. Since is continuous, is compact.
Then by Theorem 3.3 in , (a) and (b) follow.
Part (c) follows from  under the uniqueness of Fréchet mean.
3.3 A central limit theorem
The goal of this section is to derive a central limit theorem for the empirical Fréchet mean, as an important precursor for statistical inference. One of the key challenges is to establish geometric conditions on distributions on which ensure the uniqueness of the population Fréchet mean. We discuss and address the uniqueness issue in detail in Section 4. Here, our central limit theorem assumes that the uniqueness conditions of Section 4 are met.
Let be the projection from the space of labeled networks to the space of unlabeled networks.
Assume has support on a compact set defined in Theorem 4.1, so that the pushdown measure supported on has a unique Fréchet mean . Let be the empirical Fréchet mean of an i.i.d sample with respect to the distance (3.4). Let . Then we have
where with the Hessian matrix
and is the covariance matrix of .
Here denotes the partial derivative with respect to the direction, denotes second partial derivatives, and means convergence in law or distribution.
Let be a small open neighborhood of inside , so is an open subset of . Note that is a homeomorphism. By Theorem 4.1, the projection map is a Euclidean isometry. Therefore, for any vectorized network and , one has
where Thus is twice differentiable in for any . Tracing through the definition of the smooth structure on induced from the standard structure on , we see that is twice differentiable in Also note that by the consistency of (see Theorem 3.1), one has as . One can also verify the conditions (A5) and (A6) on the Hessian matrices of Theorem 2.2 in , and our Theorem follows. ∎
As an immediate consequence of this central limit theorem, we can define natural analogues of classical hypothesis tests. For example, consider the construction of a statistical test for two or more independent samples using the same framework. Assume that we have independent sets of networks on vertices, and consider the problem of testing whether or not these sets have in fact been drawn from the same population. Formally, we have independent samples , for and . Each of these populations has an unknown mean, denoted . Then, as a direct corollary to Theorem 3.2, we have the following asymptotic result.
Assume that the distributions satisfy the conditions of Theorem 3.2. Moreover, also assume that for every sample, with , and . Then, under , we have
where denotes the empirical mean of the sample, represents the grand empirical mean of the full sample, and is a pooled estimate of covariance, with the ’s denoting the individual covariance matrices estimates of each subsample.
As previously noted, this central limit theorem and such corollaries hold only if the population Fréchet mean(s) is unique. This depends crucially on the nontrivial geometry of the space of unlabeled networks. The following section deals exclusively with this issue.
4 Geometric requirements for uniqueness of the Fréchet mean
Underlying the central limit theorem in Theorem 3.2, the basic question is: which compact subsets of have a unique Fréchet mean?
We have seen in Section 2 that may be difficult to work with, while a fundamental domain for the action of on the space of labeled networks seems more tractable. Indeed, finding the Fréchet mean for a distribution supported in is a standard center of mass calculation in Euclidean space. However, it is not clear that this Fréchet mean “upstairs” in projects to the Fréchet mean “downstairs” in the quotient space , because the metric used to compute Fréchet means downstairs is the Procrustean distance, which may or may not equal the Euclidean distance.
In this section, we find a fundamental domain by a standard procedure (Lemma 4.1), and find compact subsets for which the Fréchet mean upstairs is guaranteed to project to the Fréchet mean of , where is the projection of under the quotient/projection map . This is the content of the main result in this section (Theorem 4.1). We also show that this result in our special setup is an improvement of the best result for general Riemannian manifolds due to Afsari  (see Figure 6).
We now discuss the construction of a fundamental domain . A fundamental domain is characterized by: (i) every weight vector can be permuted by some to a network in , (ii) if has for , then (As a technical note, we always consider with respect to the induced topology on from the standard topology on ) Once has been constructed, we are guaranteed that the projection map restricts to a surjective map which is a homeomorphism from the interior of to (In fact, is a diffeomorphism on this region by definition of the smooth structure on )
It is convenient to center a choice of fundamental domain on a weight vector with trivial stabilizer.
A vector is distinct if it has trivial stabilizer for the action of : i.e., if for all ,
A vector with trivial stabilizer is also called a vector with trivial automorphism group. Weight vectors with all different entries are distinct, which implies that the distinct vectors are dense in
For example, consider the two graphs and in Figure 5. Both share the same connectivity pattern (i.e., are isomorphic), but have different weight vectors. The weight vector of satisfies for . In contrast, for all , where is the weight vector of , because the two 20’s belong to nodes with different valences. Thus even though have the same set of weights, is distinct, while is not.
We now explain a standard procedure to construct a fundamental domain as one region in the Voronoi diagram of the orbit of a distinct vector. (Geometers call this a Dirichlet domain.) Let be the Procrustean distance on From now on, we just write instead of for elements of
Fix a distinct vector . Set
(i) is a fundamental domain for the action of on .
(ii) is a solid cone with polyhedral cross section.
In the proof, we use the fact that is distinct just below (4.1).
(i) First, for fixed , a minimum of is attained, since is finite. Thus every network (characterized by its weight vector ) has a permutation in .
Second, we can rewrite as
Let . Then , and