On LargeScale Graph Generation with Validation of Diverse Triangle Statistics at Edges and Vertices
Abstract
Researchers developing implementations of distributed graph analytic algorithms require graph generators that yield graphs sharing the challenging characteristics of realworld graphs (smallworld, scalefree, heavytailed degree distribution) with efficiently calculable groundtruth solutions to the desired output. Reproducibility for current generators [1] used in benchmarking are somewhat lacking in this respect due to their randomness: the output of a desired graph analytic can only be compared to expected values and not exact ground truth. Nonstochastic Kronecker product graphs [2] meet these design criteria for several graph analytics. Here we show that many flavors of triangle participation can be cheaply calculated while generating a Kronecker product graph.
Given two mediumsized scalefree graphs with adjacency matrices and , their Kronecker product graph has adjacency matrix . Such graphs are highly compressible: edges are represented in memory and can be built in a distributed setting from small data structures, making them easy to share in compressed form. Many interesting graph calculations have worstcase complexity bounds and often these are reduced to for Kronecker product graphs, when a Kronecker formula can be derived yielding the sought calculation on in terms of related calculations on and .
We focus on deriving formulas for triangle participation at vertices, , a vector storing the number of triangles that every vertex is involved in, and triangle participation at edges, , a sparse matrix storing the number of triangles at every edge. When factors and are undirected, is also undirected. In the case when both factors have no self loops we show , . Moreover, we derive the respective formulas when and have self loops, which boosts the triangle counts for the associated vertices/edges in . We additionally demonstrate strong assumptions on that allow the truss decomposition of to be derived cheaply from the truss decomposition of .
We extend these results and show Kronecker formulas for triangle participation in both directed graphs and undirected, vertexlabeled graphs. In these classes of graphs each vertex / edge can participate in many different types of triangles.
I Introduction
In recent work [3], extremely large synthetic powerlaw Kronecker graphs [2] are generated in an essentially communicationfree implementation for the primary purpose of validating graph calculation implementations on benchmarks where the answer is known exactly. The (nonstochastic) Kronecker approach leverages ideas an observations from previous work on stochastic Kronecker graph generators [4, 5, 6], with the added benefit that the ground truth of many local and global graph statistics are efficiently calculated during the generation process. In [7], several properties like graph diameter are discussed for both the nonstochastic and stochastic cases.
We form a graph whose adjacency matrix is a Kronecker product [8, 2, 9] of two much smaller factors,
as this framework provides the ability to calculate many (normally expensive) graph statistics for cheaply from associated statistics on and . A polynomial time graph calculation has the potential to be done at a squareroot of the general worstcase cost and inline with the graph generation process.
Consider triangle counting as an example. Suppose the number of edges in and , and , are both then the number of edges in is . We want to compute the number of triangles in , . For a general graph of this size, computing has worstcase complexity [10] or (and itself could be ). However, the worstcase bound for counting triangles in a of this form, using the Kronecker formula , is or ; the number of triangles in a trillionedge graph is computed sublinearly – in the worst case. If and are sparse, realworld, powerlaw graphs, the actual complexity of an implementation leveraging heuristics can be as low as , which is often significantly lower than [11, 12], even in cases where is as high as possible for a graph of its size, .
This class of generators allows researchers to validate implementations of graph calculations that ignore the Kronecker framework on problems where the solution is known, and gain confidence in the implementations’ application to extremely large realworld graphs, where the solution is fundamentally unknown and the only hope of validation is the agreement between two or more competing implementations.
Rem. 1.
The stochastic Kronecker graphs from [4, 1] are demonstrated in [7, 13] to have relatively few triangles compared to large sparse realworld graphs. This is due to the independence of edges in the stochastic model and their extremely low combined probability for most vertex triplets. The nonstochastic Kronecker graphs we consider here are fundamentally different: they do not necessarily suffer from unreasonably low triangle counts and we give many cases throughout this paper where nonstochastic Kronecker graphs have a high number of triangles. Additionally, our formulas allow tuning of local triangle counts by adding/deleting traingles and selfloops from the input factors.
Trianglerelated graph analysis is extremely important to many applications. In undirected graphs, triangle participation at vertices (or triangle degree) is a common expensive graph statistic for metrics like the local clustering coefficient of a vertex [14]. Similarly, triangle participation at edges is used for clustering coefficient of an edge. Both types of participation are additionally used in several lesslocal graph analytics like improved clustering [15], truss decompositions [16, 17, 18, 19, 20], and realistic graph generation. Furthermore, in directed graphs or labeled graphs, diverse triangle statistics that count various types of triangles are calculable at every vertex and edge (see Figs 4, 5, and 6) and these statistics are useful for several interesting applications like motifbased clustering [21] or pattern detection [22]. All forms of triangle participation are likely attractive topological features in supervised/unsupervised machine learning applications [23, 24].
There is currently significant research effort towards developing algorithms and systems capable of computing largescale triangle counting, participation, and enumeration. These efforts include implementations in MapReduce [25, 26], leveraging GPUs [27, 18, 20], in shared memory [28], utilizing linear algebraic kernels [29, 30, 31, 32], and several other graph HPC implementations [17, 12, 33, 34]. A recent workshop, IEEE HPEC 2017 Graph Challenge [18], was organized to accelerate the progress of these efforts via crosscollaboration.
In this paper, we derive several new Kronecker formulas for diverse triangle participation counts of all vertices and edges in several classes of graphs. Our contributions include:

formulas for triangle participation of all edges and vertices in an undirected Kronecker product graph, in cases where both factors have no self loops, where only one factor has self loops, or where both factors have self loops.

formulas for participation of all edges and vertices in the many types of directed triangles in a directed Kronecker product graph, in cases where one factor is directed (nonsymmetric) without self loops and the other factor is undirected (symmetric) and possibly contains self loops.

formulas for participation of all edges and vertices in the many types of vertexlabeled triangles in an undirected vertexlabeled Kronecker product graph, in cases where one factor is vertexlabeled without self loops and the other factor is unlabeled and possibly contains self loops.

Several implications of these formulas regarding properties of degree and triangle distributions for Kronecker product graphs.

A strategy for generating graphs with known truss decomposition.

Several simple examples for validating and checking these formulas.
Ii Preliminaries
Matrices formed by Kronecker products are block structured and we define some convenience functions to write the index maps compactly. For a blockstructured array with blocksize , we define functions that, for a given global index , retrieve the block number, , and the intrablock index .
The inverse of is
in the sense that
Def. 1.
Prop. 1.
Def. 2.
Def. 3.
(Standard Matrix and Vector Objects) Given , is the matrix of all zeros and is the identity matrix, both with the same size of . Constant vectors , are the vector of all zeros, and the vector of all ones, both of dimension .
We define some diagonal operators of square matrices in terms of the Haddamard product and recall several useful formulas regarding Haddamard products, as they simplifiy many derivations in the rest of the paper.
Def. 4.
(Matrix Diagonal Operators) Given , the matrix is the diagonal entries of . The diagonal operator is , a vector in .
Prop. 2.
(Properties of the Haddamard Product [9]) In the following, we implicitly assume that and whenever is present.

Commutivity. .

Scalar Multiplication. For any ,

Distributivity.

Tranposition.

HaddamardKronecker Distributivity.

DiagonalKronecker Distributivity. When and ,
Proof.
(a)(e) are standard properties. For (f),
Iia Graph Notation
Let be a set of vertices and edges, pairwise relationships between members of of the form , where . We say is undirected if implies for every (and is directed if this doesn’t hold for a single edge). An edge of the form is a self loop.
Let . The matrix is an adjacency matrix representing if for each and for each . Given an adjacency matrix , we use , , and , to represent the associated graph, vertices, and edges, respectively. We use a subscript for many other symbols referring to properties of (e.g. ).
Rem. 2.
Edge incidence matrices and matrix reorderings are important constructs for the most efficient linear algebraic formulas for performing some actual graph computations [29, 30, 31, 32]. However, edge incidence matrices would greatly complicate the Kronecker formula derivations we present, due to their row ordering being arbitrary. Therefore, we avoid using them in the derivations presented in this work.
Iii Undirected Graphs
Let , , be two adjacency matrices (possibly with self loops) on and vertices, respectively. The matrix is an adjacency matrix (possibly with self loops) on vertices. We define index maps
so or . For diagonal elements, .
Rem. 3.
(SelfLoops) As observed in [7],[3], putting self loops into the factors of and boosts the number of triangles in significantly. Therefore we will analyze the case when factors have no selfloops (for simplicity) and cases when one or more of the factors have self loops. Also, removing all self loops from an adjacency matrix can be written in terms of the Haddamard product, , making Kronecker product formulas still fairly simple for many types of graph statistics. The diagonal operator containing only the selfloops, , is used in many of the following derivations.
Throughout this section we provide several formulas for computing exact graph statistics for that involve Kronecker products of associated graph statistics of and . The following example provides the reader with sanity checks of the formulas throughout the section.
Ex. 1.
(Cliques With and Without SelfLoops) Let and define the adjacency matrix of a clique of size as . Within the graph associated with , the degree of each vertex is , the number of triangles involving each vertex is , and the number of triangles involving each edge is .
Note that is the adjacency matrix of clique of size where every vertex has a self loop. We form three simple examples of Kronecker products involving and .

Ex. 1(a), no self loops. Let . Then the degree of each vertex is . The number of triangles involving each vertex is
The number of triangles involving each edge is

Ex. 1(b), self loops in second factor. Let . Then the degree of each vertex is . The number of triangles involving each vertex is
The number of triangles involving each edge is

Ex. 1(c), self loops in both factors. Let , which is equal to . Then the degree of each vertex is . The number of triangles involving each vertex is and the number of triangles involving each edge is .
Note that Ex. 1(c) demonstrates that can have as many triangles as possible, as it is a clique on vertices.
Iiia DegreeDistribution
As shown in [7],[3], it is simple to see the degree distribution vector of , , in terms of the degree distribution vectors of and . Without self loops in and , , and
Note that is definitely not a perfect power law distribution (for one, no prime greater than max is possible). However, if and have powerlaw degree distributions (such as a Pareto distribution)we can estimate the tail behavior of ’s degree distribution to be heavytailed, as it is a multinomial of heavytailed distributions [7]. However, it is important to note that the ratio of maximum degree to number of nodes is essentially squared,
With selfloops in only, , and with self loops in both factors,
The squaring of the ratio of maximum degree to the number of nodes is qualitatively different (unless or ).
IiiB Triangle Participation of Vertices
Def. 5.
(Triangle Participation at Vertices) For adjacency matrix , triangle participation at vertices is represented by , a vector that counts the number of undirected triangles at each vertex. For undirected with self loops, we have
Note that when has no self loops, then .
Thm. 1.
(Triangle Participation at Vertices) Let . Assume the factors of are undirected, , and have no selfloops, . Then the triangle participation of each vertex is given by
Proof.
Using the diag() operator, we have
One can validate this formula for Ex. 1(a). Also, from it is easy see that the total number of triangles in obeys . Notice that without selfloops in or there will always be an even number of triangles for every vertex in . More generally, if we allow to have self loops and to have none, then has no self loops. The formula for triangle participation is still very simple.
Cor. 1.
(Self Loops) Assume the factors of are undirected graphs, , has no selfloops, , but has self loops, . Then the triangle participation of each vertex is given by
Proof in Appendix.
Using , one can validate this formula for Ex. 1(b). Note that in the corollary above contains in its th entry double counts of triangles and the four other threestep sequences from vertex to itself that involve nontrivial edges and self loops. For example, if is connected to , counts the nontriangles and
In the fully general case, where and both have self loops, we have a more complicated formula. Let . Note that for any we have diag, and (a diagonal with 0 and 1 entries) to show
One can validate this formula for Ex. 1(c), using , , .
IiiC Triangle Participation of Edges
Def. 6.
(Triangle Participation at Edges)
Triangle participation at edges, is a matrix
whose th entry is the number of triangles in which edge participates. When has no selfloops, .
A useful formula from this definition is .
Thm. 2.
(Triangle Participation at Edges) Let . Assume the factors of are undirected, , and have no selfloops, . The number of triangles in any edge within the graph of is
Proof in Appendix.
Cor. 2.
(Self Loops) Assume the factors of are undirected, , and has no selfloops, . The number of triangles in any edge within the graph of is
The proof is straightforward because , due to having no self loops. If and both have self loops, then employ , , , and to see that
Again, all three results in this section can be validated by applying the formulas to Ex. 1(a)(c).
IiiD Truss Decomposition
We discuss deriving simple Kronecker formulas for the truss and the truss decomposition [16, 11] of and give a simple example showing this is difficult in general.
Def. 7.
(Truss and Truss Decomposition [16, 11]) A truss in is a nontrivial, onecomponent subgraph of such that each edge in the truss participates in at least triangles whose edges are all in the truss.
The truss decomposition of is the sequence of edge sets
for .
A simple (yet inefficient) algorithm gives the truss decomposition of . Set . Repeat the following for , or until there are no more edges. Compute . Remove any edge that has less that triangles and update . Repeat these edge removal phases for fixed , recomputing , removing, and updating until no edges are removed. Then, set equal to all remaining edges in , increment , and repeat edge removal phases until done.
For , the formula (Thm. 2) seems useful for mapping the truss decompositions of and onto the truss decomposition of . The following example shows that a simple Kronecker formula is insufficient.
Ex. 2.
Let be a 4cycle with an added hub, , a graph with 5 vertices, 8 edges, and 4 triangles (see Fig. 3). The cycle edges of (those not involving vertex 1) all participate in a single triangle, whereas the hub edges all participate in 2 triangles. All edges from the graph of are in the 3truss, yet no edges are in the 4truss.
Let , which has an associated graph with 25 vertices, 128 edges, and 96 triangles. Due to the Kronecker decomposition of , edges can be described as combinations of hub and cycle edges (see the right side of Fig. 3). Using the Kronecker formula (Thm. 2) for triangle participation of edges, we see there are 32 edges that participate in 1 triangle (cyclecycle edges), 64 that participate in 2 triangles (hubcycle and cyclehub edges), and 32 that participate in 4 triangles (hubhub edges). The graph of has 128 edges in the 3truss, 80 edges in the 4truss, and zero in the 5truss, more complicated structure than that of a simple Kronecker product.
For a Kronecker formula for the truss decomposition of , either sophisticated assumptions on and need to be made or diagnostics of the intermediate phases of the truss decomposition algorithm need to be involved in the formula. In this work, we make fairly strong assumptions on one factor (edges of participate in at most one triangle) that imply a simple formula for the truss decomposition of . Note that edges of participate in at most one triangle is a stronger assumption than being the only nontrivial set of the truss decomposition of .
Thm. 3.
(Truss Decomposition) Let . Assume the factors of are undirected, , have no selfloops, , and edges of participate in no more than one triangle, or . We have, if and only if
for
Proof.
If is zero, then is in no triangle and is in no truss. Let be the set of edges for which . Due to the strong assumptions on , the simple truss decomposition algorithm applied to the remaining edges in proceeds in lock step with that of in the sense that any edge is removed at phase if and only if is also removed at phase , as . ∎
Note that most realworld graphs do not satisfy the assumptions on made in the previous result. Until more sophisticated theory weakens these assumptions, we have two possible strategies for generating such that are scalefree.

Delete edges in a realworld graph until all edges participate in at most one triangle, while maintaining connectivity (with any spanning tree).

Use a simple graph generator (based on preferential attachment [35]) to yield powerlaw graphs where each edge participates in no more than one triangle. The generator starts with a single edge and proceeds as follows. For each new node , pick edge uniformly at random from the previously existing edges. Pick vertex from uniformly at random and add to the list of edges. If the number of triangles that participates in is zero, then let be vertex in that wasn’t already attached, add to the list of edges, and increment the triangle count for and . Repeat for a new until the desired number of vertices is met.
Iv Directed Graphs
Iva Directed and Reciprocal Parts
We use the most general directed graph model, where directed edges and reciprocal edges are treated differently [36]. Under this model topological diversity is more locally detectable than the standard model. For example, a vertex that has 3 reciporical edges is very different than a vertex that has 6 disjoint edges (3 incoming and 3 outgoing).
Def. 8.
(Reciprocal and Directed Edges) Given a directed graph , we divide the edges in two disjoint sets, , where contains all directed edges and contains all reciprocal edges. If and then . If and then .
Def. 9.
(Reciprocal and Directed Parts of ) We linearly decompose a nonsymmetric adjacency matrix into
where is the reciprocal part of and is the directed part of . Also, the undirected version of is .
For and , we have the respective decompositions, , and . Moreover, and have simple formulas in terms the respective parts of and ,
These equations could be used for general and . We take the approach of restricting (associated with an undirected graph) so every edge in is reciprocal to greatly simplify the Kronecker formulas we derive, as
In cases where the input data for the second factor is directed, we advocate using the undirected version for so the formulas remain simple.
IvB DegreeDistribution
We have simple formulas for the standard in/out degree vectors, , ,
Under the directed/reciprocal edge model and assuming , we also have similar formula for reciprocal, directedout, and directedin degree vectors, , , and .
IvC Directed Triangle Participation of Vertices
Def. 10.
(Directed Triangle Participation at Vertices) Directed triangle participation of type at vertices, , is a vector that counts the number of directed triangles of type at each vertex (see Fig. 4). For directed with no self loops, we have
When has no reciprocal edges and has no self loops, we have a simple formula for each flavor of directed directed triangle in at vertex based on the count of the same flavor of triangle in and undirected triangle count with , at the associated vertices in ’s graph and in ’s graph.
Thm. 4.
(Directed Triangle Participation at Vertices) Let . Assume the right factor of is undirected, , and has no selfloops, . For every type of directed triangle , we have