On Large-Scale Graph Generation with Validation of Diverse Triangle Statistics at Edges and Vertices

On Large-Scale Graph Generation with Validation of Diverse Triangle Statistics at Edges and Vertices

Geoffrey Sanders, Roger Pearce, Timothy La Fond This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and was supported by the LLNL-LDRD Program under Project No. 17-ERD-024, LLNL-CONF-748352 Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
Livermore, CA, USA
sanders29@llnl.gov, pearce7@llnl.gov, lafond1@llnl.gov
   Jeremy Kepner Lincoln Laboratory Supercomputing Center
MIT Lincoln Laboratory
Lexington, MA, USA
kepner@ll.mit.edu
Abstract

Researchers developing implementations of distributed graph analytic algorithms require graph generators that yield graphs sharing the challenging characteristics of real-world graphs (small-world, scale-free, heavy-tailed degree distribution) with efficiently calculable ground-truth solutions to the desired output. Reproducibility for current generators [1] used in benchmarking are somewhat lacking in this respect due to their randomness: the output of a desired graph analytic can only be compared to expected values and not exact ground truth. Nonstochastic Kronecker product graphs [2] meet these design criteria for several graph analytics. Here we show that many flavors of triangle participation can be cheaply calculated while generating a Kronecker product graph.

Given two medium-sized scale-free graphs with adjacency matrices and , their Kronecker product graph has adjacency matrix . Such graphs are highly compressible: edges are represented in memory and can be built in a distributed setting from small data structures, making them easy to share in compressed form. Many interesting graph calculations have worst-case complexity bounds and often these are reduced to for Kronecker product graphs, when a Kronecker formula can be derived yielding the sought calculation on in terms of related calculations on and .

We focus on deriving formulas for triangle participation at vertices, , a vector storing the number of triangles that every vertex is involved in, and triangle participation at edges, , a sparse matrix storing the number of triangles at every edge. When factors and are undirected, is also undirected. In the case when both factors have no self loops we show , . Moreover, we derive the respective formulas when and have self loops, which boosts the triangle counts for the associated vertices/edges in . We additionally demonstrate strong assumptions on that allow the truss decomposition of to be derived cheaply from the truss decomposition of .

We extend these results and show Kronecker formulas for triangle participation in both directed graphs and undirected, vertex-labeled graphs. In these classes of graphs each vertex / edge can participate in many different types of triangles.

I Introduction

In recent work [3], extremely large synthetic power-law Kronecker graphs [2] are generated in an essentially communication-free implementation for the primary purpose of validating graph calculation implementations on benchmarks where the answer is known exactly. The (non-stochastic) Kronecker approach leverages ideas an observations from previous work on stochastic Kronecker graph generators [4, 5, 6], with the added benefit that the ground truth of many local and global graph statistics are efficiently calculated during the generation process. In [7], several properties like graph diameter are discussed for both the non-stochastic and stochastic cases.

Fig. 1: Under mild assumptions on and (e.g. , have no self loops, is undirected) diverse triangle statistics of are simply a constant times Kronecker products of the triangle statistic vectors / matrices associated with and , allowing for efficient exact local triangle statistics to be calculated during graph generation.

We form a graph whose adjacency matrix is a Kronecker product [8, 2, 9] of two much smaller factors,

as this framework provides the ability to calculate many (normally expensive) graph statistics for cheaply from associated statistics on and . A polynomial time graph calculation has the potential to be done at a square-root of the general worst-case cost and inline with the graph generation process.

Consider triangle counting as an example. Suppose the number of edges in and , and , are both then the number of edges in is . We want to compute the number of triangles in , . For a general graph of this size, computing has worst-case complexity [10] or (and itself could be ). However, the worst-case bound for counting triangles in a of this form, using the Kronecker formula , is or ; the number of triangles in a trillion-edge graph is computed sublinearly – in the worst case. If and are sparse, real-world, power-law graphs, the actual complexity of an implementation leveraging heuristics can be as low as , which is often significantly lower than [11, 12], even in cases where is as high as possible for a graph of its size, .

This class of generators allows researchers to validate implementations of graph calculations that ignore the Kronecker framework on problems where the solution is known, and gain confidence in the implementations’ application to extremely large real-world graphs, where the solution is fundamentally unknown and the only hope of validation is the agreement between two or more competing implementations.

Rem. 1.

The stochastic Kronecker graphs from [4, 1] are demonstrated in [7, 13] to have relatively few triangles compared to large sparse real-world graphs. This is due to the independence of edges in the stochastic model and their extremely low combined probability for most vertex triplets. The non-stochastic Kronecker graphs we consider here are fundamentally different: they do not necessarily suffer from unreasonably low triangle counts and we give many cases throughout this paper where non-stochastic Kronecker graphs have a high number of triangles. Additionally, our formulas allow tuning of local triangle counts by adding/deleting traingles and self-loops from the input factors.

Triangle-related graph analysis is extremely important to many applications. In undirected graphs, triangle participation at vertices (or triangle degree) is a common expensive graph statistic for metrics like the local clustering coefficient of a vertex [14]. Similarly, triangle participation at edges is used for clustering coefficient of an edge. Both types of participation are additionally used in several less-local graph analytics like improved clustering [15], truss decompositions [16, 17, 18, 19, 20], and realistic graph generation. Furthermore, in directed graphs or labeled graphs, diverse triangle statistics that count various types of triangles are calculable at every vertex and edge (see Figs 4, 5, and 6) and these statistics are useful for several interesting applications like motif-based clustering [21] or pattern detection [22]. All forms of triangle participation are likely attractive topological features in supervised/unsupervised machine learning applications [23, 24].

There is currently significant research effort towards developing algorithms and systems capable of computing large-scale triangle counting, participation, and enumeration. These efforts include implementations in MapReduce [25, 26], leveraging GPUs [27, 18, 20], in shared memory [28], utilizing linear algebraic kernels [29, 30, 31, 32], and several other graph HPC implementations [17, 12, 33, 34]. A recent workshop, IEEE HPEC 2017 Graph Challenge [18], was organized to accelerate the progress of these efforts via cross-collaboration.

In this paper, we derive several new Kronecker formulas for diverse triangle participation counts of all vertices and edges in several classes of graphs. Our contributions include:

  • formulas for triangle participation of all edges and vertices in an undirected Kronecker product graph, in cases where both factors have no self loops, where only one factor has self loops, or where both factors have self loops.

  • formulas for participation of all edges and vertices in the many types of directed triangles in a directed Kronecker product graph, in cases where one factor is directed (nonsymmetric) without self loops and the other factor is undirected (symmetric) and possibly contains self loops.

  • formulas for participation of all edges and vertices in the many types of vertex-labeled triangles in an undirected vertex-labeled Kronecker product graph, in cases where one factor is vertex-labeled without self loops and the other factor is unlabeled and possibly contains self loops.

  • Several implications of these formulas regarding properties of degree and triangle distributions for Kronecker product graphs.

  • A strategy for generating graphs with known truss decomposition.

  • Several simple examples for validating and checking these formulas.

Ii Preliminaries

Matrices formed by Kronecker products are block structured and we define some convenience functions to write the index maps compactly. For a block-structured array with block-size , we define functions that, for a given global index , retrieve the block number, , and the intra-block index .

The inverse of is

in the sense that

Def. 1.

(Kronecker Product [8, 2, 9]) Let and . The Kronecker Product of and is and has entries

for and , or, equivalently,

for and .

Prop. 1.

(Properties of the Kronecker Product [8, 2, 9])

  • Scalar Multiplication. For any ,

  • Distributivity.

  • Tranposition.

  • Matrix-Matrix Multiplication. When and ,

Def. 2.

(Haddamard Product [9]) Let . The Haddamard Product of and is , with

for and .

Def. 3.

(Standard Matrix and Vector Objects) Given , is the matrix of all zeros and is the identity matrix, both with the same size of . Constant vectors , are the vector of all zeros, and the vector of all ones, both of dimension .

We define some diagonal operators of square matrices in terms of the Haddamard product and recall several useful formulas regarding Haddamard products, as they simplifiy many derivations in the rest of the paper.

Def. 4.

(Matrix Diagonal Operators) Given , the matrix is the diagonal entries of . The diagonal operator is , a vector in .

Prop. 2.

(Properties of the Haddamard Product [9]) In the following, we implicitly assume that and whenever is present.

  • Commutivity. .

  • Scalar Multiplication. For any ,

  • Distributivity.

  • Tranposition.

  • Haddamard-Kronecker Distributivity.

  • Diagonal-Kronecker Distributivity. When and ,

Proof.

(a)-(e) are standard properties. For (f),

Ii-a Graph Notation

Let be a set of vertices and edges, pair-wise relationships between members of of the form , where . We say is undirected if implies for every (and is directed if this doesn’t hold for a single edge). An edge of the form is a self loop.

Let . The matrix is an adjacency matrix representing if for each and for each . Given an adjacency matrix , we use , , and , to represent the associated graph, vertices, and edges, respectively. We use a subscript for many other symbols referring to properties of (e.g. ).

Rem. 2.

Edge incidence matrices and matrix reorderings are important constructs for the most efficient linear algebraic formulas for performing some actual graph computations [29, 30, 31, 32]. However, edge incidence matrices would greatly complicate the Kronecker formula derivations we present, due to their row ordering being arbitrary. Therefore, we avoid using them in the derivations presented in this work.

Iii Undirected Graphs

Let , , be two adjacency matrices (possibly with self loops) on and vertices, respectively. The matrix is an adjacency matrix (possibly with self loops) on vertices. We define index maps

so or . For diagonal elements, .

Rem. 3.

(Self-Loops) As observed in [7],[3], putting self loops into the factors of and boosts the number of triangles in significantly. Therefore we will analyze the case when factors have no self-loops (for simplicity) and cases when one or more of the factors have self loops. Also, removing all self loops from an adjacency matrix can be written in terms of the Haddamard product, , making Kronecker product formulas still fairly simple for many types of graph statistics. The diagonal operator containing only the self-loops, , is used in many of the following derivations.

Throughout this section we provide several formulas for computing exact graph statistics for that involve Kronecker products of associated graph statistics of and . The following example provides the reader with sanity checks of the formulas throughout the section.

Ex. 1.

(Cliques With and Without Self-Loops) Let and define the adjacency matrix of a clique of size as . Within the graph associated with , the degree of each vertex is , the number of triangles involving each vertex is , and the number of triangles involving each edge is .

Note that is the adjacency matrix of clique of size where every vertex has a self loop. We form three simple examples of Kronecker products involving and .

  • Ex. 1(a), no self loops. Let . Then the degree of each vertex is . The number of triangles involving each vertex is

    The number of triangles involving each edge is

  • Ex. 1(b), self loops in second factor. Let . Then the degree of each vertex is . The number of triangles involving each vertex is

    The number of triangles involving each edge is

  • Ex. 1(c), self loops in both factors. Let , which is equal to . Then the degree of each vertex is . The number of triangles involving each vertex is and the number of triangles involving each edge is .

Note that Ex. 1(c) demonstrates that can have as many triangles as possible, as it is a clique on vertices.

Iii-a Degree-Distribution

As shown in [7],[3], it is simple to see the degree distribution vector of , , in terms of the degree distribution vectors of and . Without self loops in and , , and

Note that is definitely not a perfect power law distribution (for one, no prime greater than max is possible). However, if and have power-law degree distributions (such as a Pareto distribution)we can estimate the tail behavior of ’s degree distribution to be heavy-tailed, as it is a multinomial of heavy-tailed distributions [7]. However, it is important to note that the ratio of maximum degree to number of nodes is essentially squared,

With self-loops in only, , and with self loops in both factors,

The squaring of the ratio of maximum degree to the number of nodes is qualitatively different (unless or ).

Iii-B Triangle Participation of Vertices

Fig. 2: Triangle participation at vertices and edges in a graph with no self-loops. Left: counts the number ways can take 3 hops to get back to itself, which double counts each triangle, clock-wise (as pictured) and counterclock-wise, so counts triangles at every vertex. Right: counts the number of 2-paths between vertices and , so counts triangles at every edge.
Def. 5.

(Triangle Participation at Vertices) For adjacency matrix , triangle participation at vertices is represented by , a vector that counts the number of undirected triangles at each vertex. For undirected with self loops, we have

Note that when has no self loops, then .

Thm. 1.

(Triangle Participation at Vertices) Let . Assume the factors of are undirected, , and have no self-loops, . Then the triangle participation of each vertex is given by

Proof.

Using the diag() operator, we have

One can validate this formula for Ex. 1(a). Also, from it is easy see that the total number of triangles in obeys . Notice that without self-loops in or there will always be an even number of triangles for every vertex in . More generally, if we allow to have self loops and to have none, then has no self loops. The formula for triangle participation is still very simple.

Cor. 1.

(Self Loops) Assume the factors of are undirected graphs, , has no self-loops, , but has self loops, . Then the triangle participation of each vertex is given by

Proof in Appendix.

Using , one can validate this formula for Ex. 1(b). Note that in the corollary above contains in its -th entry double counts of triangles and the four other three-step sequences from vertex to itself that involve non-trivial edges and self loops. For example, if is connected to , counts the non-triangles and

In the fully general case, where and both have self loops, we have a more complicated formula. Let . Note that for any we have diag, and (a diagonal with 0 and 1 entries) to show

One can validate this formula for Ex. 1(c), using , , .

Iii-C Triangle Participation of Edges

Def. 6.

(Triangle Participation at Edges)

Triangle participation at edges, is a matrix

whose -th entry is the number of triangles in which edge participates. When has no self-loops, .

A useful formula from this definition is .

Thm. 2.

(Triangle Participation at Edges) Let . Assume the factors of are undirected, , and have no self-loops, . The number of triangles in any edge within the graph of is

Proof in Appendix.

Cor. 2.

(Self Loops) Assume the factors of are undirected, , and has no self-loops, . The number of triangles in any edge within the graph of is

The proof is straightforward because , due to having no self loops. If and both have self loops, then employ , , , and to see that

Again, all three results in this section can be validated by applying the formulas to Ex. 1(a)-(c).

Iii-D Truss Decomposition

We discuss deriving simple Kronecker formulas for the -truss and the truss decomposition [16, 11] of and give a simple example showing this is difficult in general.

Def. 7.

(-Truss and Truss Decomposition [16, 11]) A -truss in is a non-trivial, one-component subgraph of such that each edge in the -truss participates in at least triangles whose edges are all in the -truss.

The truss decomposition of is the sequence of edge sets

for .

A simple (yet inefficient) algorithm gives the truss decomposition of . Set . Repeat the following for , or until there are no more edges. Compute . Remove any edge that has less that triangles and update . Repeat these edge removal phases for fixed , recomputing , removing, and updating until no edges are removed. Then, set equal to all remaining edges in , increment , and repeat edge removal phases until done.

For , the formula (Thm. 2) seems useful for mapping the truss decompositions of and onto the truss decomposition of . The following example shows that a simple Kronecker formula is insufficient.

Fig. 3: Graphs from Ex 2. Left: graph associated with with hub edges colored blue and cycle edges colored red. Right: vertices from with a subset of the edges drawn and labeled as hub-hub (blue), hub-cycle (green), cycle-hub (orange), and cycle-cycle (red).
Ex. 2.

Let be a 4-cycle with an added hub, , a graph with 5 vertices, 8 edges, and 4 triangles (see Fig. 3). The cycle edges of (those not involving vertex 1) all participate in a single triangle, whereas the hub edges all participate in 2 triangles. All edges from the graph of are in the 3-truss, yet no edges are in the 4-truss.

Let , which has an associated graph with 25 vertices, 128 edges, and 96 triangles. Due to the Kronecker decomposition of , edges can be described as combinations of hub and cycle edges (see the right side of Fig. 3). Using the Kronecker formula (Thm. 2) for triangle participation of edges, we see there are 32 edges that participate in 1 triangle (cycle-cycle edges), 64 that participate in 2 triangles (hub-cycle and cycle-hub edges), and 32 that participate in 4 triangles (hub-hub edges). The graph of has 128 edges in the 3-truss, 80 edges in the 4-truss, and zero in the 5-truss, more complicated structure than that of a simple Kronecker product.

For a Kronecker formula for the truss decomposition of , either sophisticated assumptions on and need to be made or diagnostics of the intermediate phases of the truss decomposition algorithm need to be involved in the formula. In this work, we make fairly strong assumptions on one factor (edges of participate in at most one triangle) that imply a simple formula for the truss decomposition of . Note that edges of participate in at most one triangle is a stronger assumption than being the only nontrivial set of the truss decomposition of .

Thm. 3.

(Truss Decomposition) Let . Assume the factors of are undirected, , have no self-loops, , and edges of participate in no more than one triangle, or . We have, if and only if

for

Proof.

If is zero, then is in no triangle and is in no -truss. Let be the set of edges for which . Due to the strong assumptions on , the simple truss decomposition algorithm applied to the remaining edges in proceeds in lock step with that of in the sense that any edge is removed at phase if and only if is also removed at phase , as . ∎

Note that most real-world graphs do not satisfy the assumptions on made in the previous result. Until more sophisticated theory weakens these assumptions, we have two possible strategies for generating such that are scale-free.

  • Delete edges in a real-world graph until all edges participate in at most one triangle, while maintaining connectivity (with any spanning tree).

  • Use a simple graph generator (based on preferential attachment [35]) to yield power-law graphs where each edge participates in no more than one triangle. The generator starts with a single edge and proceeds as follows. For each new node , pick edge uniformly at random from the previously existing edges. Pick vertex from uniformly at random and add to the list of edges. If the number of triangles that participates in is zero, then let be vertex in that wasn’t already attached, add to the list of edges, and increment the triangle count for and . Repeat for a new until the desired number of vertices is met.

Iv Directed Graphs

Iv-a Directed and Reciprocal Parts

We use the most general directed graph model, where directed edges and reciprocal edges are treated differently [36]. Under this model topological diversity is more locally detectable than the standard model. For example, a vertex that has 3 reciporical edges is very different than a vertex that has 6 disjoint edges (3 incoming and 3 outgoing).

Def. 8.

(Reciprocal and Directed Edges) Given a directed graph , we divide the edges in two disjoint sets, , where contains all directed edges and contains all reciprocal edges. If and then . If and then .

Def. 9.

(Reciprocal and Directed Parts of ) We linearly decompose a non-symmetric adjacency matrix into

where is the reciprocal part of and is the directed part of . Also, the undirected version of is .

For and , we have the respective decompositions, , and . Moreover, and have simple formulas in terms the respective parts of and ,

These equations could be used for general and . We take the approach of restricting (associated with an undirected graph) so every edge in is reciprocal to greatly simplify the Kronecker formulas we derive, as

In cases where the input data for the second factor is directed, we advocate using the undirected version for so the formulas remain simple.

Iv-B Degree-Distribution

We have simple formulas for the standard in/out degree vectors, , ,

Under the directed/reciprocal edge model and assuming , we also have similar formula for reciprocal, directed-out, and directed-in degree vectors, , , and .

Iv-C Directed Triangle Participation of Vertices

Fig. 4: Fifteen possible different combinations (removing symmetries) of directed triangles with reciprocal and directed edges from a vertex’s perspective. The triangle types are depicted by listing first the central vertex’s role in the triangle (is it a source ’s’, target ’t’, or undirected ’u’ with respect to the two incident edges). The last character represents the direction of the remaining edge, oriented (’+’ forward, ’-’ backward, or ’o’ undirected) in listed order.
Def. 10.

(Directed Triangle Participation at Vertices) Directed triangle participation of type at vertices, , is a vector that counts the number of directed triangles of type at each vertex (see Fig. 4). For directed with no self loops, we have

When has no reciprocal edges and has no self loops, we have a simple formula for each flavor of directed directed triangle in at vertex based on the count of the same flavor of triangle in and undirected triangle count with , at the associated vertices in ’s graph and in ’s graph.

Thm. 4.

(Directed Triangle Participation at Vertices) Let . Assume the right factor of is undirected, , and has no self-loops, . For every type of directed triangle , we have