General-Purpose Join Algorithms for Listing Triangles in Large Graphs

General-Purpose Join Algorithms for Listing Triangles in Large Graphs


We investigate applying general-purpose join algorithms to the triangle listing problem in an out-of-core context. In particular, we focus on Leapfrog Triejoin (LFTJ) by Veldhuizen[36], a recently proposed, worst-case optimal algorithm. We present “boxing”: a novel, yet conceptually simple, approach for feeding input data to LFTJ. Our extensive analysis shows that this approach is I/O efficient, being worst-case optimal (in a certain sense). Furthermore, if input data is only a constant factor larger than the available memory, then a boxed LFTJ essentially maintains the CPU data-complexity of the vanilla LFTJ. Next, focusing on LFTJ applied to the triangle query, we show that for many graphs boxed LFTJ matches the I/O complexity of the recently by Hu, Tao and Yufei proposed specialized algorithm MGT [10] for listing tiangles in an out-of-core setting. We also strengthen the analysis of LFTJ’s computational complexity for the triangle query by considering families of input graphs that are characterized not only by the number of edges but also by a measure of their density. E.g., we show that LFTJ achieves a CPU complexity of for planar graphs, while on general graphs, no algorithm can be faster than . Finally, we perform an experimental evaluation for the triangle listing problem confirming our theoretical results and showing the overall effectiveness of our approach. On all our real-world and synthetic data sets (some of which containing more than 1.2 billion edges) LFTJ in single-threaded mode is within a factor of of the specialized MGT; a penalty that—as we demonstrate—can be alleviated by parallelization.


Hu, Tao, and Yufei [10] recently proposed a novel algorithm (MGT) for listing triangles in large graphs that is both I/O and CPU efficient; and also outperforms existing competitors by an order of magnitude. At the same time, there has been exciting theoretical research that shows it is possible to design so-called worst-case optimal join algorithms [3]. This begs the question: How would general-purpose join algorithms compare to the best specialized triangle-listing algorithms in a setting where not all data fits into main memory?

This question is motivated by the desire of building general-purpose systems that can empower their (domain) users to pose and run queries in a declarative and general language, such as SQL or Datalog—a goal that likely is little controversial. We focus on the out-of-core setting not only because of the obvious reasons of input or intermediary data not fitting in main memory, but also because we like to utilize graphics processing cards (GPUs) as high-throughput co-processors during query evaluation. GPU memory is currently limited to up to around 12GB [40], highlighting the urgency for robust out-of-core techniques.

The triangle listing is the basic building block for many other graph algorithms and key ingredient for graph metrics such as triangular clustering, finding cohesive subgraphs etc [10]. In addition, it has gotten extensive attention in the research literature among several fields: graph theory, databases and network analysis to name a few. Here, both in-memory as well as in an out-of-core algorithms have been studied. Having a general-purpose technique being able to compete with the best-in-class hand-crafted algorithms that are specific for triangle listing, would indeed, be very good news for the database community advocating high-level, declarative query languages.

We selected Leapfrog-Triejoin (LFTJ) by Veldhuizen as the general-purpose join algorithm for our study. This is for various reasons: (1) its elegance allows efficient implementations with various optimizations, (2) by nature, LFTJ only uses intermediary data, making it a very good candidate in the out-of-core context, and (3) because of its strong theoretical worst-case guarantees [36]. LFTJ’s worst-case guarantee in its generality is technical [36]. Roughly, it guarantees that for a given query and input , LFTJ will never perform asymptotically more steps (up to a log-factor) than what are strictly necessary for any correct algorithm on inputs that are similar to . Here, similar means that, eg., the sizes of the input relations cannot change nor can certain other statistics of the data.

Model & Assumptions. We restrict our attention to full-conjunctive queries, and use a Datalog syntax and terminology to describe queries (or joins). Our formal setting is the standard one for considering I/O efficient algorithms: Input, intermediary and output data can exceed the amount of available main memory (measured in words to store one atomic value), in which case it can be read (written) from (to) secondary storage with the granularity of a block that has size . Reading or writing a block incurs 1 unit of I/O cost. For I/O and CPU cost, we consider data complexities, that is we assume the query to be fixed and small. In particular, we like to be larger than, say 10 times, the number of atoms multiplied by their maximum arity. Furthermore, to simplify complexity results, we assume that is larger than . This restriction is mostly theoretic: Using a block size of 64KiB with a 64-bit word-width, inputs only need to be larger than 15MiB to satisfy the requirement. With these assumptions in mind, we make the following contributions:


Boxing LFTJ. We present and analyze a novel strategy we call boxing for out-of-core execution of a multi-predicate, worst-case optimal join algorithm (Leapfrog-Triejoin). This method exhibits the following properties:


For queries with variables, executing on input data and producing output data of size , boxed LFTJ requires at most block I/Os. We show that this bound is worst-case optimal, in the sense that for any , we can construct a query such that no algorithm can have an asymptotically better bound with respect to and .


We further show that if the input data exhibits limited skew (in the sense we will make precise) then boxed LFTJ requires only I/Os. Here, denotes the rank of the query—a property we will define. The rank of a query never exceeds the number of variables used in the Datalog body, and is often lower.


We also analyze the computational complexity of boxed . Here, we show that if the input size is only a constant factor larger than the available memory , then the asymptotic CPU work performed by the boxed (essentially)1 matches the asymptotic complexity of the in-memory maintaining its theoretical guarantees.

Boxed . We apply boxing to the triangle-listing problem. Here, the input graph exhibits limited-skew if the degree of its nodes is limited by . With 100GiB of main memory this allows graphs containing nodes with up to 1.3 million neighbors.

On such graphs, our approach requires block I/Os, matching the asymptotic I/O bound of the recently presented specialized algorithm MGT [10] for triangle listing.

In-memory . We also tighten the analysis for the CPU complexity of the conventional in-memory LFTJ applied to the triangle listing query with non-trivial arguments. It is easy to see that ’s achieved asymptotic complexity of is worst-case optimal modulo the log-factor. We improve on this result in two ways:


We show that for graphs with an arboricity , requires work. A graph’s arboricity is a measure of its density (as we will explain later) which never exceeds . Moreover, is substantially smaller for many graphs [7]; for example, for both planar graphs and graphs with fixed maximum degree, . As a corollary, we thus obtain that runs in on planar graphs.


We further improve on the worst-case optimality analysis: We show that even if we are only interested in families of graphs for which their arboricity is limited by any function , e.g. by , and we would like to design a specialized algorithm that (only) works (well) on these graphs, then this algorithm cannot have an asymptotic complexity that is in . This result shows that is worst-case optimal for any of these families (modulo the log-factor).

Evalation. We further present an experimental evaluation, where we focus on the triangle query. We confirm that the boxing technique works well, especially when the input data is only a constant factor larger than the available memory: on real-world and synthetic graphs with each more than 1.2 billion nodes, boxing only introduces little CPU overhead; and has good performance even when only limited main memory is available. We also compare the raw performance against two competitors: a specialized [32] C++-based implementation in the graph-processing system Graphlab [16] and the specialized triangle listing algorithm MGT [10]. LFTJ is about 65% the Graphlab implementation, yet scales to larger data sizes. When running single-threaded, LFTJ is on average 3x slower than MGT. Our parallelized version of LFTJ, however, is slightly faster than the single-threaded MGT (about 30% main memory is restricted to as much as 10%

The rest of the paper is structured as follows: Section 2 reviews the relevant background information. We present and analyze the boxing strategy for LFTJ in Section ?. Section 4 analyzes the in-memory and the boxed variant of LFTJ applied to the triangle query. Section 5 highlights some important aspects of our implementation, before we experimentally evaluate our approaches in Section 6. We review related work in Section 7 and conclude in Section 8.


2.1Review: Leapfrog-Triejoin (LFTJ)

LFTJ [36] is a multi-predicate join algorithm. Unlike traditional binary join algorithms such as Hash-Join or Sort-Merge-Join which take two relations as input, LFTJ takes as input relations together with the join conditions.

Trie representation for relations.

It is convenient to think of relations to be represented as a Trie2. A Trie is a tree that stores each tuple of a relation as a path from the root node to a child node. See Fig. ? for an example of a ternary relation with its trie in Fig. ?. In general, a Trie for a relation with arity has a height of . For a relation , the nodes at height store values from the th column of . We require that children of the same node are unique and ordered increasingly. For example in Fig. ? at level 2, the children of b are the values u, v, and w, which are in increasing order.

TrieIterators. LFTJ accesses relational data not directly but via a TrieIterator interface. This not only allows various storage schemes3 but also facilitates uniform handling of “infinite” predicates such as Equal, SmallerThan or Plus. The TrieIterator interface provides methods to navigate the Trie of a relation. It can be thought of as a pointer to a node in the Trie. The detailed methods for Trie navigation are given in Apx. Appendix A.1. The methods are to access a data value; and to move up and down in the trie. The linear iterator methods , , and are used to move within unary “sub-relations” such as or . Here, moves one step right and is used to forward-position the iterator to the element with value ; if is not in , then the iterator is placed at the element with the smallest value . In general, if the iterator passes the end of the represented relation such as , the will return true. A key to good LFTJ performance is that back-end data structures efficiently support these TrieIterator operations. In fact, the theoretical guarantees given by LFTJ require that , , have complexity . Furthermore, and must not take longer than individually and must have an armortized cost of at most if keys are visited. Here, stands for the size of the unary relation the iterator is for; eg, .

Leapfrog Join. A basic building block of Leapfrog Triejoin (LFTJ) is Leapfrog join (LFJ). It computes the intersection of multiple unary relations. For this, LFJ has a linear iterator for each of its input relations. An execution of LFJ is reminiscent of the merge-phase of a merge-sort; however instead of returning values that are in any of the inputs, we search and return values that are in all input relations. To do so efficiently, we use to iteratively advance the iterator positioned at the relation with the smallest value to the largest value amongst the iterators. If all iterators are placed on the same value, we have found a value of the intersection.

Using LFJ to join relations with and denoting the cardinalities of the smallest and largest relation, respectively, has the following complexity:

The detailed algorithms for the Leapfrog join as well as LFTJ are given in Apx. Appendix A as reference; for an even more detailed introduction and reference see [36].

Leapfrog TrieJoin Restrictions. LFTJ requires that no variable occurs more than once in a single body atom. This can be achieved via simple rewrites: Given a join with, e.g., the atom in the body, we introduce a new variable and replace by where is the infinite equal-relation which itself is represented by a specialized TrieIterator.

As mentioned above, LFTJ is parameterized by an order on the variables of the join. This order is usually chosen by an optimizer as the exact order might influence runtime characteristics and can have an effect on the theoretical bounds for the I/O complexity as we will detail below. Furthermore, the chosen order determines the sort-order of the input relations: In particular, arguments in atoms of the join body must form a subsequence of the chosen order. E.g., consider the order : body atoms or are allowed while the atom needs to be replaced by an alternative index which is created as . These indexes are created in a pre-processing step.


We use a simple array-encoding for Tries, which is inspired by the Compressed-Sparse-Row (CSR) format—a commonly used format to store graphs. As an example see Fig. ? for the representation of the trie given in Fig. ?. The data values are stored in flat arrays called value-arrays. Index arrays are used to separate children at the same tree level but from different parent nodes. An -ary relation has value arrays and index arrays. In particular, the children of a node stored in the value array at position are stored in the array starting at the index from until the index inclusively. E.g. in Fig. ?, the children of from are stored in from to .

To reduce notation, we will often simply identify a relation with its TrieArray representation and vice versa in the rest of the paper. For example, when we write a -ary TrieArray we mean a TrieArray for an -ary relation .

All TrieIterator operations are trivial to implement for TrieArrays; except possibly seek, where some attention needs to be given to achieve the required armortized complexity. Here, instead of starting the binary in the middle of the remaining sub-array, we probe with an exponentially increasing lookup sequence of eg., to narrow lower and upper bounds for the binary search.

While the TrieArray representation is beneficial for execution, it is also fairly cheap to create:

Sketch. The space requirement is obvious; furthermore the data structure can be built from a lexicographically sorted in two passes: pass 1 determines the sizes of the value and index arrays, pass 2 fills in data.

2.3LFTJ for Computing Triangles

Given a simple, undirected graph and let be its directed version, that is for each edge , contains the pair . The query


3Boxing LFTJ

We first motivate our strategy by showing that LFTJ can suffer from excessive I/O operations in an external-memory setting with a block-based least-recently-used memory replacement strategy. As example, we use the triangle query with specifically crafted input graphs.

LFTJ on the triangle query. It is useful to highlight the steps that LFTJ performs for the triangle query . These are summarized in Algorithm ?. Note that Algorithm ? is not the pseudo-code of the program we use to list triangles; it only summarizes the steps LFTJ performs when run on the triangle query. First, the leapfrog join at level for the atoms and computes the intersection between and . Then, for each found value for , we perform a leapfrog-join at level computing the intersection of with , because the variable occurs in the atoms and . In the last step, we find bindings for by intersecting with because occurs in the atoms and .

Example inputs that causing excessive I/O. For , consider the graph with edges as:

where being slightly larger than the number blocks fitting into main memory at once. See Fig. ? in the appendix for an example with , , , and . The key idea is that we place values in the second column of by apart which will cause LFTJ to perform an I/O for every tuple in for step 3 in Algorithm ?; furthermore, we make sure values in the second column repeat in groups large enough that loading all blocks in a group will preempt the first block from memory effectively prohibiting the algorithm to reuse the earlier loaded blocks.

See Apx. Appendix B.1.

3.1High-Level Idea

We now describe our out-of-core adaptation for LFTJ. LFTJ with a variable order computes the join by essentially searching over an dimensional space in which each dimension spans over the domain of the variable . Loosely speaking, the space is searched in lexicographical order. As the example above demonstrates, this can lead to excessive I/O costs. Further I/O accesses are caused by the potentially non-local accesses for the binary searches of leapfrog-join.

In our approach, we partition the -dimensional search space into “hyper-cubes” or boxes such that the required data for an individual box fits into memory. LFTJ is then run over each box individually—finding all input data ready in memory. We strive for the following properties: (i) Determining box-boundaries is efficient: both in CPU and I/O work. (ii) Loading data that is restricted to a box is efficient, again, both in terms of CPU and I/O work. (iii) The total amount of data loaded is minimal.

Fig. ? illustrates this strategy for . The join uses three variables , , – resulting into a 3-dimensional search-space. If the input graph represented via a TrieArray does not fit into the available memory, then we partition the search space into boxes, for example as in Fig. ?. The partitioning is chosen such that the input data restricted to an individual box fits into memory. is then run for each box individually one after another while join results are written append-only in a streaming fashion.

We now explain the different aspects in detail.

3.2TrieArray Slices

We assume that input data is given on external storage in a TrieArray representation, with the attribute order consistent with the chosen key order for LFTJ. This can easily be achieved via a pre-processing step that costs block I/O and CPU steps. When loading data for a single box into main memory, we directly operate on the TrieArray representation to subset the data. The remainder of this subsection shows that this step can be done very efficiently.

In general, applying any selection to a TrieArray for a predicate to obtain a TrieArray for can be done in cpu work and I/Os if can be computed in time and space for tuples . This is because TrieArrays can be used to efficiently enumerate the represented tuples in lexicographically order, and they can also efficiently be built from lexicographically sorted tuples.

We are interested in certain range-based selections. It turns out that these can be built even faster—with costs proportional to the selected size rather then the total data set size (modulo log-factor), or even less depending on the cost-model.

In general, for an -ary relation , we are interested in creating slices at a level , . At level the values are restricted to an interval given by a low-bound and a high-bound ; at levels , the slice contains only a single element each. Formally:

We create and store Slices in the TrieArraySlice data structure, which is a conventional TrieArray—except that the index arrays can be parameterized with an offset to perform dynamic index-adaptation as explained in the example above. As with TrieArrays, we identify the Slice (set of tuples) with the TrieArraySlice data structure and vice versa in the rest of the paper.

Given a relation on secondary storage, we can create slices of efficiently:

Sketch The provisioning process is as follows: using binary searches on the value arrays , we locate the prefix in ; the slice is empty if the prefix does not exist. Then, using two more binary searches we locate the smallest element and the largest in of . Their positions are the boundaries in and for the interval we copy into the slice. For the remaining value arrays and index arrays, we iteratively follow the pointers within the arrays and copy the appropriate ranges. As a last step we adjust the index-array’s offset parameter: for each , we set the offset parameter of to .

We require I/Os for the binary searches and I/Os for copying the continuous values from the arrays with indexes . Similarly, the binary searches require CPU work; the remaining CPU work accounts for requesting the copy operations.

Note that besides the logarithmic component, provisioning a slice amounts to simply copying large, continuous arrays from secondary storage into main memory. On modern hardware, these can be done using DMA methods without causing any significant CPU work. Moreover, modern kernels might simply memory map the to-be-copied pages and perform actual copies only when pages are modified.

Probing. As the last building block, we are interested in provisioning slices that will fill up a certain budgeted amount of memory. In particular, we specify the prefix-tuple and lower bound as before. But instead of providing an upper bound , we give a memory budget in blocks as shown in Fig. ?. We are then interested in a maximal upper bound such that the slice at from to requires no more than blocks of memory. Note that for skewed data, it is possible that the slice requires more than blocks of memory, even when . Should this case occur, we report via the sentinel value SPILL instead of returning an upper bound . Not surprisingly, probing is also efficient:

Sketch Similar to slice provisioning, except that we do a binary search for the upper bound and check for each guess how many blocks the TrieSlice would occupy. This can be done by following the pointers. Determining the size of the TrieSlice for each guess requires at most I/Os where is the arity of . Since we binary search in , an array that is at most size , we obtain the required complexity of .

3.3Boxing Procedure

To help exposition, we first describe aspects of the boxing approach via examples, before we cover the general case.

Joins with one variable. Consider a join over multiple unary relations such as

Imagine each of the body relations is larger than the available internal memory . We can divide the internal memory into four parts, one for the output data and one for each of the input relations. Since the output is written append-only, a relatively small portion of memory, which is written to disk once it fills up, is sufficient. We thus divide up the bulk of the memory for the three input relations. We can use the simple strategy to evenly divide the space. A boxed LFTJ execution would then simply alternate probing, provisioning, and calling LFTJ as described in Fig. ?.

Not surprisingly, this approach would work well for the limited class of joins: for reading the input, it requires a number of I/Os bound by with being the combined size of the input relations. The key observations for showing the bound is that in each iteration (except possibly the last), at least one relation will load (in our example around ) tuples using block reads. Now, since there are only tuples in the input, there are at most iterations. Since each probing can be done in we obtain the desired bound4.

Unary cross-products. Consider the cross-product of unary relations, with each relation larger than :

We again split the bulk of the available memory across the input relations. The boxing procedure is recursive where each dimension i of the recursion corresponds to a variable (See Fig. ?). The procedure starts with i. In general, at a dimension , we loop over the predicate via the probe-provisioning loop. Then, for each slice at dimension , we do the same recursively for the next higher dimension. At the bottom of the recursion—when we reached the , we call LFTJ on the created slices. Then, the slices provide data for the box , i.e., in which the variable can range from i to i. Note that (like above) we can run the original query over the slice data since the slices are guaranteed to not have data outside their range and thus the boxes partition the search-space without overlap.

General joins. The general approach combines the two previous algorithms while also considering corner cases. Let be a general full-conjunctive join of atoms, and variable order with no atom containing the same variable twice, and all atoms in mentioning variables consistent with . We first group the atoms based on their first variable : we place all atoms that have as first variable into the array at position . To follow the exposition, consider the join

where we put , and into and into . Like for cross-products, we recursively provision for the dimension ranging from to . For each , we use the method for joining unary relations for the atoms in . In particular, for each we probe and create slices for at level 0 regardless of or the arity of . Thus, at dimension , we iteratively provision atoms with as their first attribute restricting the range of but not any of the other variables , . This ensures that we can freely choose any partitions we might perform on these variables for . Like with cross-products, we call LFTJ at the lowest level when .

The above works well unless any of the probes reports a SPILL, which can occur if a relation exhibits significant skew. For example, imagine there is a value for which exceeds the allocated storage. Then, at dimension , probing at level with a lower bound will return SPILL. We handle these situations by setting the upper bound at level to , and essentially marking as a relation that needs to be provisioned at the dimension of its second attribute (eg, 3) alongside the atoms in . Note that a relation of arity can spill times in worst case.

The general algorithm is given in Algorithm ?. We evenly divide the available storage among the dimensions, and assign the atoms to accordingly (lines 3-4). We also use a variable to let lower dimensions utilize memory that was not fully used by higher dimensions. In line 11, we union the spills from the previous level to the atoms we need to provision. The method probe in line 12, probes atoms in atms to find an upper bound such that all atoms can be provisioned. We here, evenly divide mem by the size of atms. The lower bound for probing are taken from low, which is also used to determine the starting tuples for possible spills. The method sets the upper bound at the current dimension and fills the spills predicate if necessary. The method provision provisions the predicate with bounds from low and high adapted to the variables occuring in . It returns the slice and the size of used memory.

3.4I/O Complexity of Boxing

We now analyze the Boxing approach to obtain complexity bounds on the number of block I/Os. Since we concentrate on full conjunctive queries, every output tuple is computed exactly once by LFTJ. As explained above, we use some constant-size buffer to let the I/O cost for the output be where is the output size. We now analyze the cost of the I/Os for reading input data.

For each dimension , , let be an upper bound on how often the repeat-until loop from lines 9–23 of Algorithm ? is executed for a single invocation of the surrounding procedure. is determined by how often we need to provision to completely iterate through the atoms in ii. In each step (except possibly the last) at least one of the input predicates loads tuples—this is the predicate that determines the high bound . In case there is no spill, this is immediately clear; but it is even true if a predicate is being spilled because its tuples are then “consumed” at a higher dimension. Note, that at the last dimension, no spills can occur. We thus have , where is the number of atoms in the join.

Let us now determine how often for each dimension is called. We denote this number by . The outermost is called once; is called once for each iteration of the repeat loop at level 1, that is times. In general, and consequently .

It is convenient to inject the following observation:

Proof The number of boxes equals the number of loop executions at dimension , which is bound by .

Back to the I/O costs. Consider only the I/O that is performed directly in a certain call without counting the cost in the recursive calls from line 19. First, we count provisioning only. Here, during the evaluation of the repeat loop (lines 9-23), we load the data in . Similarly as in the case of joins with one variable, we can cache the last blocks containing the last tuple of the provisioned TrieSlices, and thus load each block from the input exactly once. Consequently, the I/O work done to provision directly in each invocation of is limited by . The I/Os necessary for probing can be bound by since we probe at most relations once for each execution of the repeat loop. If we use the assumption that is larger than as explained in Section 2, we thus obtain as I/O cost directly at dimension for a single call. As last step, we multiply by to obtain the total I/O cost at dimension as . Since output is written once and we consider joins without projections we obtain:

Optimality. This complexity is optimal when only the number of variables is used to characterize the query. This is because the Cartesian product of relations can produce output which requires block writes.

Furthermore, in practice, the input is often only by a constant factor larger than the available memory:

This (better looking) bound is, obviously, optimal for queries that require reading the entire input.

No spills. If the execution does not produce any spills, we can strengthen the general result. To do so, we quickly need to introduce a property of queries:

Clearly, the rank of a query (for any key-order) is bound by the number of variables—but sometimes smaller. E.g., for the triangle query , but also is 2. Note that is the largest for which is non-empty when boxing with key-order .

proof At dimensions , there are no I/O operations since both and are empty, obtaining the desired result by summing up for .

Spills occur in the boxed LFTJ execution if there is an input relation and value for which the Slice exceeds the size of the memory allocated for . We can thus characterize when they occur. For a query with variables and atoms: Let be the memory used for the body of the query. If we divide up all space evenly among all variables, and for each dimension, evenly among all predicates, then the critical value for any is approximately , since the slice for has a size of at most around .

3.5CPU Data-Complexity of Boxing

The CPU work performed by a boxed on input data falls into two categories: (1) the work necessary to determine the number of boxes and to provision them, and (2) the work done by the in-memory executing over the boxes. For an input , the asymptotic work in category 2 is trivially bound by the asymptotic work of the in-memory on multiplied by the number of boxes used, simply because each invocation uses input that is a subset of . For the work in category 1: deciding on the bounds of a single box is done in , copying its required data takes no more than resulting into a total upper bound of for boxes.

Using Lemma ?, we can thus conclude:

4LFTJ Applied to Triangle Listing

4.1Boxed LFTJ-

From Corollary ?, we immediately get an I/O complexity of if . Without this assumption, plugging the triangle query into Theorems ? and ?, we obtain:

With no spills, boxed LFTJ thus matches the I/O complexity of MGT [10], which is optimal if as shown in [10]. From above, we know that spills only occur if there is a single node that has more than around neighbors, for 5GiB of allocated memory and 64bit node ids, this amounts to an upper limit of 37 million neighbors per node, a number that is seldom reached in practice. Interestingly, the core MGT algorithm in [10] also requires that the node degree is limited. MGT achieves the bound without restrictions by deploying a pre-processing step.

For the compute complexity of boxed , we rely on Theorem ?, expecting to essentially maintain the performance of in-memory assuming .

4.2In-Memory LFTJ- CPU Complexity

In this section, we use the conventions that is always the input graph. While the previous section was specific to our version of LFTJ that uses TrieArrays, the results here apply to all LFTJ implementations as long as the basic TrieIterator operations adhere to the complexity bounds given in [36] and restated in Section ?. Following with little work directly from [36] and [3]:

See Apx. Appendix B.2.

The rest of the section, strengthens this result by analyzing the complexity of LFTJ on families of graphs that are characterized by the number of edges and their arboricity. The arboricity of an undirected graph is a standard measure for graphs, counting the minimum number of edge-disjoint forests that are needed to cover the graph. A classic result by Nash-Williams [19] links this number to the graph’s density by showing that no subgraph of has more than edges if and only if . In general, is in [7] for any graph . However, in many real-world graphs, is significantly smaller[7].

It turns out that the runtime-complexity for is related to the graph’s arboricity with behaving better the smaller is. It thus makes sense to consider ’s complexity for graphs characterized by an upper bound on their arboricity. For compatibility with the asymptotic complexity, we bound the graph’s arboricity with respect to their edge-size:

See Apx. Appendix B.3. Analyze the work done by the leapfrog joins at levels , , and . Only the third level is interesting, where we use a result by [7] that gives an upper bound of for the sum .

Clearly, if the maximum degree of our graphs is bounded, than their arboricity is in . Furthermore, the arboricity of planar graphs is also in [7], immediately leading to:

We can also amend the optimality result from Prop. ? showing that remains optimal (modulo log-factor) even when considering graphs with a limited arboricity:

See Apx. Appendix B.4. It turns out that for any such , we can construct large graphs that have triangles.

We highlight that the above theorem is quite general. It only requires the algorithm to be correct for input graphs of restricted arboricity5. For example, even if we (somehow) knew that all our input graphs have an arboricity bound by, say, , we could not design a specialized algorithm that only works on these graphs and has a runtime complexity of .

The optimality from Theorem ? does unfortunately not directly follow from the worst-case optimality of LFTJ for families of instances that are closed under renumbering (Thm 4.2 in[36]), because the optimality in [36] was obtained when each relation symbol appears only once in the body of the join, a property used in the proof for Thm 4.2 of [36].


We have implemented a general-purpose join-processing system with LFTJ at its core. To highlight its generality, we briefly list its current features. We support multiple fixed-size primitive data types including int64, double, boolean, and a fixed-point decimal type. Predicates (stored as TrieArrays) can have variable arities and we support marking a prefix of the attributes as key (the TrieArray then needs fewer index arrays). Predicates support loading and storing from and into CSV files. Besides materialized predicates that store data, we have TrieIterator implementations for various “builtins” such as comparison operators and arithmetic operators. Using a simple command-shell, joins such as the triangle query can be issued in a Datalog-like syntax. We require the written joins to have atoms with variables consistent with a global key order. At the head of rules, we support optional projections, and some aggregations. The system uses secondary storage (via memory-mapped files) to allow processing of data that exceeds the physical memory; and deploys the here presented Boxing technique. We have not implemented a query optimizer (to find good key orders), nor do we currently support mutating relations, also we do not support transactions. In the following, we highlight aspects of the system that likely have an impact on performance, yet whose detailed analysis and description goes beyond the scope of this paper.

Removing interpretation overhead. Datalog queries that are issued are compiled to optimized machine-code and loaded as a shared library into the system. Our code still uses the TrieIterator interfaces but most code is templatized: predicates by their arity, key-length and types of the attribute; TrieIterators by their types and arity; the LFTJ by the key-order, TrieIterators of body atoms as well as each of their variables; a rule by the LFTJ for processing the body and the classes that perform so-called head-actions. Using this approach, we can still program with the convenient TrieIterator interfaces—yet allow the C++ compiler to potentially inline join processing all the way down to the binary searches using the appropriate comparison operators for the type at hand.

Misc Optimizations. We are also deploying a parallelization scheme for LFTJ to utilize multiple cores. In the boxed LFTJ version, boxes are worked on one after another, yet LFTJ utilizes available cores while processing a single box. We will also provide single-threaded performance when comparing with single-threaded competitors.

Even though dividing the available memory evenly across the dimensions is sufficient to obtain the asymptotic complexity bounds, using more memory at smaller dimensions reduces the number of boxes created. Note that as long as the memory used at each dimension is a constant fraction of the total memory, the complexity bounds remain in tact. We picked a ratio of 4:1 for dividing up the memory between : in the triangle query. We also do not allocate budget to dimensions that do not have an atom using as first variable (eg, ). This is fine since in case there is a spill the budget for the spilling relation will be moved over to the next dimension.

If there are two atoms referring to the same relation and having the same first variable, we naturally only provision and create one slice for them. For example in the triangle query, we probe and provision a single relation at dimension for the atom and the atom . Of course, in the case of spills they might get untangled at higher dimensions. We do not exploit the fact that the third atom refers to the same relation.

We envision that for some queries, an optimizer, aided by constraints provided by the user, can avoid provisioning certain boxes because it can infer that there cannot possibly be a query result within that box. For example, in our case, we know that . This can easily be inferred from the constraint for any . Based on this, we do not need to provision at dimension if the high bound for is smaller than the low bound for . We have put a hook into the boxing mechanism to bypass provisioning if after probing this condition is met. A detailed exploration of constraints and their interactions with probing and provisioning is beyond the scope of this work.

6Experimental Evaluation

In our experimental evaluation, we focus on the triangle listing problem. Here, we investigate the following questions: (1) What is the CPU overhead introduced by boxing LFTJ? (2) How well does boxed LFTJ cope with limited available main memory, how does vanilla LFTJ do? (3) How does LFTJ compare to best-in-class competitors?

Evaluation environment. We use a desktop machine with an Intel i7-4771 core, that has 4 cores (8 hyper-threaded), each clocked at 3.5GHz. The machine has 32GB of physical memory and a single SSD disk. It is running Ubuntu 14.04.1 with a stock 3.13 Linux kernel.

Data. We use both real-world and synthetic input data of varying sizes. The data statistics are shown in Fig. ?. The smallest dataset we consider is “LJ”, which contains the friend-ship graph of the on-line blogging community LiveJournal [14]. Next, “Orkut” is the friend-ship graph of the free online community Orkut [18]. ‘TWITTER’ is one of the largest freely available graph data sets. It contains the as-of-2010 “follower” relationships among 42 Million twitter users [12]. The dataset has 1.47 billion of these relations, which we interpret as undirected edges in a graph, resulting in 1.2 billion edges. This dataset contains almost 35 billion triangles. Unlike the first two data sets, which we obtained from [14], twitter was gathered from [39]. We also consider synthetically generated data due to its better understood characteristics. We focus on two datasets: ‘RAND’ and ‘RMAT’. Each comes in a medium-sized version with 16 million nodes and 256 million edges and a large version with 80 million nodes and 1.28 billion edges. In the ‘RAND’ dataset, we create edges by uniformly randomly selecting two endpoints from the graph’s nodes. The ‘RMAT’ data contains graphs created by the Recursice Matrix approach as proposed by Chakrabarti et al.[5]. This approach creates graphs that closely match real-world graphs such as computer networks, or web graphs. We used the data generator available at [41] with its default parameters. The LiveJournal and the synthetic graphs were also used by the MGT work in [10] and earlier work [8] to evaluate out-of-core performance for the triangle listing problem. All graphs have been made simple by removing self and duplicate edges. The CSV sizes in Fig. ? refer to the CSV data where each undirected edge is mentioned only once. TA stands for our TrieArray representation as described in the earlier sections. We use 64 bit integers per node identifier.

Methodology. We measure and present the time for running the algorithms on the mentioned data sets with various configurations and memory restrictions. We will run our TrieArray-based implementation of LFTJ with various configurations and two competing algorithms. Since all algorithms need to report the same number of triangles, we essentially run them in “counting-mode” and we thus do not account for the time nor the I/O it takes to output the triangles. This was also done in [10]. Input data for LFTJ is given in TrieArray format; we do not include the time it takes to create the TrieArray from CSV data (which can be done in at most two passes after sorting the data).

Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the 80 variants.
Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the “80” variants.

What CPU overhead does Boxing introduce? To measure the CPU overhead that is introduced by the boxing approach, we advise LFTJ to only use memory the size of a fraction of the input during execution—yet, we do not place any limit on the caches the operating system keeps for file operations. To further (almost completely) remove I/O, we prefix the execution by cat-ting all input data to /dev/null, which essentially pre-loads the Linux file-system cache. We now consider the two questions (i) What is the CPU overhead for probing and copying? and (ii) What is the overhead introduced by running LFTJ on individual boxes in comparison to running LFTJ on the whole input data. To answer the first question, we simply run three variants: (a) the full LFTJ, (b) probing and copying data into TrieArraySlices without running LFTJ, and (c) only probing without copying input data nor running LFTJ. Results are shown in the first row of Fig. ?. On the X-Axis, we vary the space available for boxing. The individual points range from up to of the input data size in TrieArray representation. We chose to range up to 200% since the input is essentially read twice by : once for each of the dimensions and .

Results. Answering question (i): We can see that the CPU work performed for probing and copying is very low in comparison to the work done by the join evaluation, even when the box sizes are limited to as little as of the size of the input. Answering (ii), we look at the red lines for LFTJ and compare the curve with the value at the far right as this one is achieved by using a single box. The real-world data sets behave as expected: starting at around 25%, they level out demonstrating that the CPU overhead is low if the available memory is not too much smaller than the input data size. Now, for the synthetic data sets, we see that unexpectedly, using more boxes reduces the CPU work (memory range 10%–200%). We speculate that this is because the boxed version might reduce the work done in binary searches for seek since the space that needs to be searched is smaller. Only at 5%, does this trend reverse and using more smaller boxes takes longer.

How well does Boxing do with limited memory? We are also interested in the performance of the boxing technique when disk I/O needs to be performed. Here, we run the same experiments as above but we clear all linux system caches (see Apx. Appendix C.1) before we start a run. We further use Linux’s cgroup feature to limit the total amount of RAM used for the program (data+executable) and any caches used by the operating system to buffer I/O on behalf of the program. As actual limit we use the value given to the boxing and shown on the X-Axis plus a fixed 100MB (that accounts for the output buffer and the size of the executable). Results are shown in the second row of Fig. ?. We see that probing is still very cheap even for the 5% memory setting; Provisioning the data now has noticeable costs for low-memory settings (25% and below). However, even then, it is mostly dominated by the time to actually perform the in-memory joins. This is even more so for the real-world data sets. Overall, with around 25% or more memory, boxed LFTJ’s performance stays constant indicating that I/O is not the bottleneck. For example, we can count all 37 billion triangles in the TWITTER dataset in around 29 minutes without I/O and only need up to 35 minutes with disk I/O.

In the third row of Fig. ?, we show number of boxes used as well as the total amount of memory copied for provisioning as a multiple of the TrieArray input size from Fig. ?. We see that the number of boxes is generally below 100 unless the memory is restricted to below 25%; similarly, we never copy more than 15x of the input data even for a 5% memory restriction. An example for how the boxes were chosen for the TWITTER data set is shown in Fig. ?. Each figure shows the front (x-y) plane of the 3-D input space. Darker pixels stand for more data of the represented area. We see that boxes become smaller around the more data-dense areas. See Apx. Appendix D for more details.

Last, we are interested in how the boxed LFTJ compares to a variant without our extension. Since LFTJ as presented in [36] is a family of algorithms that needs to be parameterized by how data is physically stored and how the TrieIterator operations are implemented, answering this question is hard since conclusions for one specific implementation of the data back-end might not hold for another. In particular, our approach of storing data in huge arrays and performing mostly binary searches over them might be particularly bad from an I/O perspective. However, having these considerations in mind, we also ran our version of LFTJ with the cgroup memory restrictions and a provisioning mode that does not copy the data but leaves it in memory-mapped files6. The data is thus paged in (from the input file) by the Linux virtual memory system that using a standard replacement strategy. Results for this experiment are shown in Fig. ?: The average speed ratios of vanilla over boxed for the memory levels of 10%, 25%, and 35% are 65x, 30x, and 20x, respectively.

 Performance Graphlab vs. Boxed LFTJ for single and multi threaded configurations. No resource limitations. Y-axis shows runtime in seconds.
 Performance Graphlab vs. Boxed LFTJ for single and multi threaded configurations. No resource limitations. Y-axis shows runtime in seconds.
Performance Graphlab vs. Boxed LFTJ for single and multi threaded configurations. No resource limitations. Y-axis shows runtime in seconds.

How does boxed LFTJ compare to specialized best-in-class competitors for triangle listings? We compare to (1) the triangle counting algorithm presented in Shank’s dissertation [32] which has been implemented for the graph analysis framework Graphlab [16]. We chose this algorithm as our in-memory competitor since it supports multiple threads and was used in other comparisons [37] before. We also (2) compare to the MGT algorithm by Hu, Tao, and Chung [10] as the (to the best of our knowledge) currently best triangle listing algorithm in the out-of-core setting. Our results are shown in Fig. ? and Fig. ?. The boxed LFTJ is on average 65% slower than Graphlab, both when run in single-threaded mode as well as in multi-threaded mode with 8 threads. Graphlab, being optimized for an in-memory setting with optional distribution7, was not able to run any of our large data sets getting “stuck” once all of the 32GB of main memory and 32GB of swap space had been consumed.

Comparing to MGT (cf. Fig. ?): We used the cgroup-memory restrictions and cleaned caches for running MGT and boxed-LFTJ. When we run in single-threaded mode, then MGT outperforms boxed LFTJ by a factor of , , and in the configurations with 10%, 25%, and 35% of the memory, respectively. Due to time constraints, we did not run LFTJ in single-threaded mode for other configurations. When we allow LFTJ to utilize all of the 4 available cores, we are on average 47%, 22%, and 28%, respectively, faster than the single-threaded MGT. We have not investigated how well MGT parallelizes. Note that MGT internally uses only 32 bits as node identifiers (vs. our 64bit identifiers). Nevertheless, we used the same values to configure and limit the amount of memory for both MGT and LFTJ.

7Related Work

Related work spans multiple areas at different levels of generality. From most broad to more specific:

The SociaLite effort [33] at Stanford also proposes to use systems based on relational joins (in this case Datalog) for graph analysis. They show that declarative methods not only allow for more succinct programs but are also competitive, if not outperform typical other implementations. We did not compare our join performance with the SociaLite system as it is clearly more feature-rich; it is also Java-based which might or might not influence performance in ways orthogonal to our investigation. We note that the benchmarks presented in [33] and [34] that–among other queries–evaluate counting triangles did not use datasets as large as ours.

A worst-case optimal join algorithm has first been presented by Ngo et al. in [21] following the AGM bound [3] that bounds the maximum number of tuples that can be produced by a conjunctive join. Leapfrog Triejoin by Veldhuizen [36], the join algorithm we are using, has been shown to be worst-case optimal as well (modulo a log-factor). In fact, [36] showed that Leapfrog Triejoin is worst-case optimal (modulo log-factor) for more fine-grained families of inputs. Our work, especially on the worst-case optimality for graphs with limited arboricity was inspired by the worst-case optimal results in [36]. A good survey and description of this class of worst-case optimal join algorithms is [22], where the authors not only describe the AGM bound and its application, but also the original NPRR algorithm and LFTJ.

Most recently, Khamis, Ngo, Re, and Rudra proposed so-called beyond-worst-case-optimal join algorithms. Here, the performed work is not measured against a worst-case within a set family of inputs—but instead must be proportional to the size of a shortest proof of the results correctness. The idea was proposed by Ngo, Nguyen, Re and Rudra in [20]. Furthermore, [11] combines ideas from geometry and resolution transforming the algorithmic problem of computing joins to a geometric one. Following this line of research is very interesting as it might offer even better performance in practice.

Our boxing approach is most closely related to the classic block-nested loop join (BNLJ)[29]. An interesting avenue for future work would be to investigate how optimizations and results for the BNLJ transfer to the multi-predicate LFTJ.

Listing triangles in graphs is a well-researched area in computer science. For the in-memory context, see [13] for a recent survey. Triangle listing can also be reduced to matrix multiplication. Recent work that proposes new algorithms based on this approach is [4]. Chiba and Nishizeki [7] propose an in-memory triangle listing algorithm that runs in matching the best possible bound we give in Section 4. To the best of our knowledge, our insight that this is the best possible theoretical bound for this class of graphs, is novel and thus provides new insights about these algorithms. Earlier, [26] already showed that enumerating all triangles in planar graphs is a linear-time problem.

Triangle listing in the out-of-core context: Following up on the MGT work [10], Rasmus and Silvestri investigate the I/O complexity of triangle listing [23]. They improve on the complexity of MGT from to an expected . They also give lower bounds and show that their algorithm is worst-case optimal by proving that any algorithm that enumerates triangles needs to use at least I/Os. They also give a deterministic algorithm using a color coding technique. Investigating whether the techniques used could be generalized to general joins is a very interesting avenue for future work. Prior to [10], [17] proposed an algorithm whith an I/O complexity of ; furthermore [9] proposed an algorithm with an I/O complexity of . Cheng et al. [6] study the general problem of finding maximal cliques. We did not benchmark against these algorithms since MGT dominated them by an order of magnitude.

Research has also been done to distribute triangle counting and other graph algorithms [1], and approaches that use the MapReduce framework [27].


For the well-studied problem of triangle listing, we have investigated how a general-purpose & worst-case optimal join algorithm compares against specialized approaches in the out-of-core context. By using Leapfrog Triejoin, we were able to devise a strategy that not only allows for good theoretical bounds in terms of I/O and CPU costs but we also demonstrated very good performance: For very large input graphs of 1.2 billion edges and more, LFTJ counts triangles with a speed of 4 million input edges per second for uniformly random data; and performs a complete count of the 35 billion triangles in the twitter dataset in little over 25 minutes on a standard 4-core desktop machine while limiting the available main memory to around 5GB. Our positive results can be interpreted as a confirmation for the database community’s theme of creating systems to empower (domain-expert) users via declarative query interfaces while providing very good performance.

Acknowledgements. We thank Ken Ross and Haicheng Wu for comments on an earlier draft; and we thank the anonymous reviewers for their comments.

ALeapfrog Join and Leapfrog Triejoin

a.1TrieIterator Example

A TrieIterator is initialized to the root node . Methods for vertical navigation are: for moving “down” to the first children of the current node and for moving “up” to the parent of the current node. Horizontally, movement is restricted to direct siblings, which are accessed via the LinearIterator interface that comprises the methods , , , and . It is convenient to think of the children of a node to be stored increasingly sorted in an array of size . The methods returns true if the iterator is positioned after the last element (eg., at position ). The method requests to move to the next element; will be true if the iterator was at the last position already (e.g., calling at position ). The method can be used to forward-position the iterator to the element with value ; if is not in , then the iterator is placed at the element with the smallest value , or if no such exists. Finally, data is accessed at granularity of a single domain element via the method , which returns the element at the current position. The methods , , , and may only be called if is false; furthermore, the value given to must be at least ; and must not be called at the root node .

a.2Leapfrog TrieJoin Procedure

Given a join description as a Datalog rule body with atoms and variables. For each of the atoms, a single TrieIterator is created. Furthermore, LFTJ maintains an array of Leapfrog joins—one join for each variable. The LFJ for variable uses pointers to the TrieIterators for atoms that mention the variable . Overall, LFTJ is implemented as a TrieIterator itself8 (see Algorithm ?). A variable remembers at which level of the output trie the iterator is positioned. The horizontal navigation methods manipulate , open and close the appropriate TrieIterators, and initialize the Leapfrog joins. The linear iterator methods are then simply delegated to the LFJ which computes the appropriate intersections.


b.1Proof for Prop.

Consult Fig. ?. The variable assignments for , , and as well as the corresponding neighbors and are shown. Each node in causes two block s. Further, the block storing the node with id 24 of will be evicted when and or earlier, and the last block with is thus repeatedly loaded when , , and .

Detailed proof sketch for general case: The outer loop in line 1 of Algorithm ? ranges from to . For each value , we then join ’s neighbors with (line 2) to obtain bindings for . Since each node has exactly one neighbor , this essentially performs a lookup of in the first column of . Now, since we spaced the second values in with a distance of apart, locating each within incurs at least one I/O. Also, since the second values in repeat in groups of size , the blocks needed for the second group will have been evicted from memory before they are needed, resulting in a single I/O for each tuple in . The last step is to intersect the neighbors of with the neighbors of . In our TrieArray representation, this will incur another I/O.9

b.2Proof for Proposition

The bound on the runtime can easily be obtained from the worst-case optimality wrt. input sizes of LFTJ (Corollary 4.3 in [36]) and the fractional edge-cover bound [3]: For any three binary relations , , the result size of the join is limited according to the fractional edge cover [3]. If the sizes of , , and agree than is at most with ; adding the log-factor, we obtain the desired bound of .

The complexity is optimal modulo the log-factor since a graph with edges can have triangles.

b.3Proof for Theorem

Let be as required. We now analyze the work done by on a graph with its directed version (possibly obtained via a preprocessing. Let be all nodes in that have an outgoing edge as usual. It is useful to also consult Fig. ? for an explanation of which Leapfrog joins are executed during . We now count the steps at each variable:

  • At level : We Leapfrog-join with itself yielding a bound of based on the requirements for the TrieIterator operations (see Section ?).

  • At level : for each , a leapfrog-join is performed between and . As usual, are the followers of , i.e., . Summing up all cost and using that the runtime of a leapfrog-join between two relations of size and , respectively, is bound by , we obtain:

  • At level : Here, for (at most) each edge in we leapfrog join the neighbors of with the neighbors of . We thus incur the work:

    As Lemma 2, Chiba and Nishizeki [7] observed that for any graph , the sum is bounded by . Since and because is monotonically increasing, we can bound the work by , finishing the proof.

b.4Proof for Theorem

We first show:

Informal overview of technique. To get many triangles, we use fully connected graphs ; to stay under the arboricity limit, we choose appropriately; to get many edges, we just union many of these into the graph, and then filling up with singleton edges. The math works out to the above quantity.

Formal proof. Let be as required. Fix an . Let . Note that the fully connected graphs with nodes have edges. We construct a graph by packing as many as we can fit into our “-edges budget” and filling up the rest with unconnected edges: Let , let be the graph composed of instances of and single edges not connected to anything else. To complete the proof, we show: (1) The arboricity of is , and (2) has at least triangles.

Showing (1). The classic Nash-Williams result [19] states that for any graph , its arboricity is characterized by the maximum edge-node ratio among all its subgraphs:

It can easily be verified that choosing a as subgraph maximizes the ratio. Thus, .

Showing (2) As short-hand let , and let , which is the largest integer multiple of that is not larger than . Each has triangles, and we have of them, totaling in

triangles as required.

We proof Theorem ? indirect. Let be an arbitrary, monotonically increasing, computable function, not identical to 1, that is in . And, let be an algorithm that lists all triangles in graphs with in time. Let be the maximal number of steps performs on any graph with and .

Since runs in time: choose and let be such that for all we have:

From : choose and let such that for all we have .

Now, let be a large enough number such that (1) , (2) , (3) , and (4) . We can satisfy all conditions since maps into , is monotonically increasing, and is not identical to 1. We apply Lemma ? with our for , and conclude there is a graph with edges and arboricity at most with at least triangles. Clearly, needs to take at least steps on . Thus:

CSystems Aspects

c.1Caches and Limiting Resident Memory

To clear Linux file caches we used as root:

sync && echo 3 > /proc/sys/vm/drop_caches

We restricted the memory that a process uses for any reason (data, heap, program, caches, etc) using Linux cgroups. Investigating later via top, confirms that only the allowed resident memory is used by the process. We used as root commands such as:

# create a group mkdir -p /sys/fs/cgroup/memory/limit_mem
# add process to group
   > /sys/fs/cgroup/memory/limit_mem/tasks
# limit memory
   > /sys/fs/cgroup/memory/limit_mem/

DMore details for Fig. 8

The input space for is 3-dimensional. We box for = and = . Since there were no spills, intervals for dimension are always . The figures show how these boxes are created by projecting the 3-D input space onto the x-y plane. Darker pixels indicate areas where there is more data. In particular, the image was created as follows: For of the directed graph for the twitter dataset which can be viewed as a point-set in 2D space, create a 2D histogram with 150x150 bins. Then, because we slice along the first dimension and collect the nodes plus their neighbors, we aggregate over ’s second dimension (eg, ) values to obtain a 1D histogram showing the total number of neighbors the nodes in a certain bin have. We then spread this 1D histogram into a 2D space by setting the value at position to . This “image” is indicative of the total amount of data for a rectangular box. As a last step, we equalize the histogram and map into greyscale to have a prettier picture. The red boxes are then drawn on top according to the made provisioning decisions during the boxing procedure. In the picture the -axis goes from bottom left to bottom right, the axis from bottom-left to top-left—the same way as in Fig. ?. Note, that the number of columns corresponds to how often we need to load the input data at level .


  1. Except when the in-memory ’s complexity is in , in which case the boxed version’s complexity is .
  2. also called prefix tree, radix tree or digital tree
  3. e.g., regular B+-Trees, sorted list of tuples, or the TrieArrays we describe later
  4. Note that with a simple caching strategy for the provision step (always cache the block containing and reuse in the next provisioning if possible), we could make the argument that each block is read at most once by the provisioning step obtaining the same asymptotic bound.
  5. Except for the corner-case where the arboricity is bound by 1, in which case the graphs have no triangles and an algorithm trivially exists.
  6. We also experimented with this so-called lazy provisioning for boxed LFTJ: here, lazy and eager show about the same performance; we omited the data for space reasons.
  7. which we did not evaluate
  8. The actual join results are collected by walking the Trie.
  9. When storing relations in B-Trees or as an array of lexicographically sorted tuples the single neighbor of might already be available once has been loaded. However, even the reduced I/O cost of at least demonstrates thrashing.


  1. Enumerating subgraph instances using map-reduce.
    F. N. Afrati, D. Fotakis, and J. D. Ullman. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 62–73. IEEE, 2013.
  2. Optimizing joins in a map-reduce environment.
    F. N. Afrati and J. D. Ullman. In Proceedings of the 13th International Conference on Extending Database Technology, pages 99–110. ACM, 2010.
  3. Size bounds and query plans for relational joins.
    A. Atserias, M. Grohe, and D. Marx. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 739–748. IEEE, 2008.
  4. Listing triangles.
    A. Björklund, R. Pagh, V. Williams, and U. Zwick. In J. Esparza, P. Fraigniaud, T. Husfeldt, and E. Koutsoupias, editors, Automata, Languages, and Programming, volume 8572 of Lecture Notes in Computer Science, pages 223–234. Springer Berlin Heidelberg, 2014.
  5. R-MAT: A recursive model for graph mining.
    D. Chakrabarti, Y. Zhan, and C. Faloutsos. In SDM, volume 4, pages 442–446. SIAM, 2004.
  6. Finding maximal cliques in massive networks.
    J. Cheng, Y. Ke, A. W.-C. Fu, J. X. Yu, and L. Zhu. ACM Transactions on Database Systems (TODS), 36(4):21, 2011.
  7. Arboricity and subgraph listing algorithms.
    N. Chiba and T. Nishizeki. SIAM Journal on Computing, 14(1):210–223, 1985.
  8. Triangle listing in massive networks and its applications.
    S. Chu and J. Cheng. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 672–680. ACM, 2011.
  9. Algorithm engineering for large data sets.
    R. Dementiev. PhD thesis, Saarland University, 2006.
  10. Massive graph triangulation.
    X. Hu, Y. Tao, and C.-W. Chung. In Proceedings of the 2013 international conference on Management of data, pages 325–336. ACM, 2013.
  11. A resolution-based framework for joins: Worst-case and beyond.
    M. A. Khamis, H. Q. Ngo, C. Ré, and A. Rudra. CoRR, abs/1404.0703, 2014.
  12. What is Twitter, a social network or a news media?
    H. Kwak, C. Lee, H. Park, and S. Moon. In WWW ’10: Proceedings of the 19th international conference on World wide web, pages 591–600, New York, NY, USA, 2010. ACM.
  13. Main-memory triangle computations for very large (sparse (power-law)) graphs.
    M. Latapy. Theoretical Computer Science, 407(1):458–473, 2008.
  14. SNAP Datasets: Stanford large network dataset collection.
    J. Leskovec and A. Krevl., June 2014.
  15. Arboricity, h-index, and dynamic algorithms.
    M. C. Lin, F. J. Soulignac, and J. L. Szwarcfiter. Theoretical Computer Science, 426:75–90, 2012.
  16. Distributed graphlab: A framework for machine learning and data mining in the cloud.
    Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Proc. VLDB Endow., 5(8):716–727, Apr. 2012.
  17. An external memory algorithm for listing triangles.
    B. Menegola. Technical report, Universidade Federal do Rio Grande do Sul, 2010.
  18. Measurement and Analysis of Online Social Networks.
    A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego, CA, October 2007.
  19. Decomposition of finite graphs into forests.
    C. S. A. Nash-Williams. Journal of the London Mathematical Society, s1-39(1):12, 1964.
  20. Beyond worst-case analysis for joins with minesweeper.
    H. Q. Ngo, D. T. Nguyen, C. Ré, and A. Rudra. CoRR, abs/1302.0914, 2014.
  21. Worst-case optimal join algorithms:[extended abstract].
    H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. In Proceedings of the 31st symposium on Principles of Database Systems, pages 37–48. ACM, 2012.
  22. Skew strikes back: New developments in the theory of join algorithms.
    H. Q. Ngo, C. Re, and A. Rudra. ACM SIGMOD Record, 42(4):5–16, 2014.
  23. The input/output complexity of triangle enumeration.
    R. Pagh and F. Silvestri. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, June 22-27, 2014, pages 224–233, 2014.
  24. Colorful triangle counting and a mapreduce implementation.
    R. Pagh and C. E. Tsourakakis. Inf. Process. Lett., 112(7):277–281, 2012.
  25. Uncovering the overlapping community structure of complex networks in nature and society.
    G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Nature, 435(7043):814–818, 2005.
  26. The clique problem for planar graphs.
    C. H. Papadimitriou and M. Yannakakis. Information Processing Letters, 13(4–5):131 – 133, 1981.
  27. Mapreduce triangle enumeration with guarantees.
    H.-M. Park, F. Silvestri, U. Kang, and R. Pagh. Proc. 23rd CIKM, 2014.
  28. Scalable big graph processing in mapreduce.
    L. Qin, J. X. Yu, L. Chang, H. Cheng, C. Zhang, and X. Lin. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 827–838. ACM, 2014.
  29. Database management systems, volume 3.
    R. Ramakrishnan and J. Gehrke. McGraw-Hill New York, 2003.
  30. Rascal: Calculation of graph similarity using maximum common edge subgraphs.
    J. W. Raymond, E. J. Gardiner, and P. Willett. The Computer Journal, 45(6):631–644, 2002.
  31. Clip: similarity searching of 3d databases using clique detection.
    N. Rhodes, P. Willett, A. Calvet, J. B. Dunbar, and C. Humblet. Journal of chemical information and computer sciences, 43(2):443–448, 2003.
  32. Algorithmic aspects of triangle-based network analysis.
    T. Schank. Phd in computer science, University Karlsruhe, 2007.
  33. Socialite: Datalog extensions for efficient social network analysis.
    J. Seo, S. Guo, and M. S. Lam. In 29th International Conference on Data Engineering (ICDE), pages 278–289. IEEE, 2013.
  34. Distributed socialite: a datalog-based language for large-scale graph analysis.
    J. Seo, J. Park, J. Shin, and M. S. Lam. Proceedings of the VLDB Endowment, 6(14):1906–1917, 2013.
  35. Counting triangles and the curse of the last reducer.
    S. Suri and S. Vassilvitskii. In Proceedings of the 20th international conference on World wide web, pages 607–614. ACM, 2011.
  36. Triejoin: A simple, worst-case optimal join algorithm.
    T. L. Veldhuizen. In Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014., pages 96–106, 2014.
  37. Multipredicate join algorithms for accelerating relational graph processing on GPUs.
    H. Wu, D. Zinn, M. Aref, and S. Yalamanchili. Fifth International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS 2014), September 2014.
  38. Defining and evaluating network communities based on ground-truth.
    J. Yang and J. Leskovec. CoRR, abs/1205.6233, 2012.
  39. 2010 Twitter data set. Downloaded Oct 2014 from
  41. R-MAT Data generator. madduri/software/GTgraph/.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description