I/O efficient bisimulation partitioning on very large directed acyclic graphs
Abstract
In this paper we introduce the first efficient externalmemory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internalmemory bisimulation algorithms have been investigated. As the size of realworld DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory.
Our general algorithm has a worstcase IOcomplexity of , where and are the numbers of nodes and edges, resp., in the data graph and is the number of accesses to external memory needed to sort an input of size . We also study specializations of this algorithm to common variations of bisimulation for treestructured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of externalmemory bisimulation algorithms. To this end, the full opensource C++ implementation has been made freely available.
definition
I/O efficient bisimulation partitioning on very large directed acyclic graphs
Jelle Hellings 
Hasselt University 
Belgium 
exbisim@jhellings.nl 
George H.L. Fletcher 
Eindhoven University of Technology 
The Netherlands 
g.h.l.fletcher@tue.nl 
Herman Haverkort 
Eindhoven University of Technology 
The Netherlands 
cs.herman@haverkort.net 
Data modeled as directed acyclic graphs (DAGs) arise in a diversity of practical applications such as biological and biomedical ontologies [?], web folksonomies [?], scientific workflows [?], semantic web schemas [?], business process modeling [?, ?], data provenance modeling [?, ?], and the widely adopted XML standard [?]. It is anticipated that the variety, uses, and quantity of DAGstructured data sets will only continue to grow in the future.
In each of these application areas, efficient searching and querying on the data is a basic challenge. In reasoning over massive data sets, typically index data structures are computed and maintained to accelerate processing. These indexes are essentially a reduction or summary of the underlying data. Efficiency is achieved by performing reasoning over this reduction to the extent possible, rather than directly over the original data.
Many approaches to indexing have been investigated in preceding decades. Reductions of data sets typically group together data elements based on their shared values or substructures in the data. In graphs, the notion of bisimulation equivalence of nodes has proven to be an effective means for indexing (e.g., [?, ?, ?, ?, ?, ?, ?]). Bisimulation, which is a fundamental notion arising in a surprising range of contexts [?], is based on the structural similarity of subgraphs. Intuitively, two nodes are bisimilar to each other if they cannot be distinguished from each other by the sequences of node labels that may appear on the paths that start from these nodes, as well as from each of the nodes on those paths. Grouping bisimilar nodes is known as bisimulation partitioning. Blocks of bisimilar nodes are then used as the basis for constructing indexing data structures supporting efficient search and querying over the data.
Efficient internalmemory solutions for computing bisimulation partitions have been investigated (e.g., [?, ?, ?]). To scale to realworld data sets such as those discussed above, it becomes necessary to consider DAGs resident in external memory. In considering algorithms for such data, the primary concern is to minimize disk IO operations due to the high cost involved, relative to mainmemory operations, in performing reads and writes to disk.
Due to the random access nature of internalmemory algorithms, the design of externalmemory algorithms which minimize disk IO typically requires a significant departure from approaches taken for internal memory solutions [?]. In particular, stateoftheart internalmemory bisimulation algorithms can not be directly adapted to IOefficient externalmemory algorithms due to their inherent random access behaviour. While a study has been made on storing and querying bisimulation partitions on disk [?], there has been to our knowledge no approach developed to date for efficiently computing bisimulation partitioning in external memory.
Motivated by these observations, in this paper we give the first IOefficient externalmemory bisimulation algorithm for DAGs. Our algorithm has a worstcase IOcomplexity of , where and are the number of nodes and edges, resp., in the data graph and is the number of accesses to external memory needed to sort an input of size . Efficiency is achieved by intelligent organization of the graph on disk, and by sophisticated processing of the graph using global and local reorganization and careful staging and use of local bisimulation information. We establish the theoretical efficiency of the algorithm, and demonstrate its practicality via a thorough empirical evaluation on data sets having billions of nodes and edges.
Our algorithm is simple enough for practical implementation and use, and to serve as the basis for further study and design of externalmemory bisimulation algorithms. For example, we also develop in this paper specializations of our algorithm for computing common variations of bisimulation for treestructured graphs in the form of XML documents. Furthermore, the complete implementation is opensource and available for download.
We proceed in the paper as follows. In the next section, we present basic definitions concerning our data model, bisimulation equivalence, and the standard externalmemory computational model. In Section LABEL:sec:partitioning, we then present and theoretically analyze our externalmemory bisimulation algorithm. In Section LABEL:sec:xml, we show how to specialize our general algorithm for various bisimulation notions proposed for XML data. In Section LABEL:sec:exp, we then present a thorough empirical analysis of our approach, and conclude in Section LABEL:sec:conclude with a discussion of future directions for research.
In the context of this paper, a graph is a triple , where is a finite set of nodes, is a directed edge relation, and is a function with domain that assigns a label to every node . With a slight abuse of terminology, we call a child of , and a parent of , if and only if contains an edge . Let be the set of all children of , and let be the set of all parents of . Note that in our work we only consider acyclic graphs. Furthermore, we assume that the node set is ordered in reverse topological order, that is, children always precede their parents in the order. Assuming a topological ordering is standard in the design of external memory DAG algorithms [?]. Indeed, real world data is often already ordered (e.g., XML documents), and, furthermore, practical approaches to topological sorting of massive data sets are available [?].

Let and be two, possibly the same, graphs. Nodes and are bisimilar to each other, denoted , if and only if:

the nodes have the same label: ;

for every node there is a node such that , and:

For every node there is a node such that .
We can extend this notion to complete graphs as follows:
