I/O efficient bisimulation partitioning on very large directed acyclic graphs

I/O efficient bisimulation partitioning on very large directed acyclic graphs

July 4, 2019

In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory.

Our general algorithm has a worst-case IO-complexity of , where and are the numbers of nodes and edges, resp., in the data graph and is the number of accesses to external memory needed to sort an input of size . We also study specializations of this algorithm to common variations of bisimulation for tree-structured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of external-memory bisimulation algorithms. To this end, the full open-source C++ implementation has been made freely available.



I/O efficient bisimulation partitioning on very large directed acyclic graphs

Jelle Hellings
Hasselt University

George H.L. Fletcher
Eindhoven University of Technology
The Netherlands

Herman Haverkort
Eindhoven University of Technology
The Netherlands


Data modeled as directed acyclic graphs (DAGs) arise in a diversity of practical applications such as biological and biomedical ontologies [?], web folksonomies [?], scientific workflows [?], semantic web schemas [?], business process modeling [?, ?], data provenance modeling [?, ?], and the widely adopted XML standard [?]. It is anticipated that the variety, uses, and quantity of DAG-structured data sets will only continue to grow in the future.

In each of these application areas, efficient searching and querying on the data is a basic challenge. In reasoning over massive data sets, typically index data structures are computed and maintained to accelerate processing. These indexes are essentially a reduction or summary of the underlying data. Efficiency is achieved by performing reasoning over this reduction to the extent possible, rather than directly over the original data.

Many approaches to indexing have been investigated in preceding decades. Reductions of data sets typically group together data elements based on their shared values or substructures in the data. In graphs, the notion of bisimulation equivalence of nodes has proven to be an effective means for indexing (e.g., [?, ?, ?, ?, ?, ?, ?]). Bisimulation, which is a fundamental notion arising in a surprising range of contexts [?], is based on the structural similarity of subgraphs. Intuitively, two nodes are bisimilar to each other if they cannot be distinguished from each other by the sequences of node labels that may appear on the paths that start from these nodes, as well as from each of the nodes on those paths. Grouping bisimilar nodes is known as bisimulation partitioning. Blocks of bisimilar nodes are then used as the basis for constructing indexing data structures supporting efficient search and querying over the data.

Efficient internal-memory solutions for computing bisimulation partitions have been investigated (e.g., [?, ?, ?]). To scale to real-world data sets such as those discussed above, it becomes necessary to consider DAGs resident in external memory. In considering algorithms for such data, the primary concern is to minimize disk IO operations due to the high cost involved, relative to main-memory operations, in performing reads and writes to disk.

Due to the random access nature of internal-memory algorithms, the design of external-memory algorithms which minimize disk IO typically requires a significant departure from approaches taken for internal memory solutions [?]. In particular, state-of-the-art internal-memory bisimulation algorithms can not be directly adapted to IO-efficient external-memory algorithms due to their inherent random access behaviour. While a study has been made on storing and querying bisimulation partitions on disk [?], there has been to our knowledge no approach developed to date for efficiently computing bisimulation partitioning in external memory.

Motivated by these observations, in this paper we give the first IO-efficient external-memory bisimulation algorithm for DAGs. Our algorithm has a worst-case IO-complexity of , where and are the number of nodes and edges, resp., in the data graph and is the number of accesses to external memory needed to sort an input of size . Efficiency is achieved by intelligent organization of the graph on disk, and by sophisticated processing of the graph using global and local reorganization and careful staging and use of local bisimulation information. We establish the theoretical efficiency of the algorithm, and demonstrate its practicality via a thorough empirical evaluation on data sets having billions of nodes and edges.

Our algorithm is simple enough for practical implementation and use, and to serve as the basis for further study and design of external-memory bisimulation algorithms. For example, we also develop in this paper specializations of our algorithm for computing common variations of bisimulation for tree-structured graphs in the form of XML documents. Furthermore, the complete implementation is open-source and available for download.

We proceed in the paper as follows. In the next section, we present basic definitions concerning our data model, bisimulation equivalence, and the standard external-memory computational model. In Section LABEL:sec:partitioning, we then present and theoretically analyze our external-memory bisimulation algorithm. In Section LABEL:sec:xml, we show how to specialize our general algorithm for various bisimulation notions proposed for XML data. In Section LABEL:sec:exp, we then present a thorough empirical analysis of our approach, and conclude in Section LABEL:sec:conclude with a discussion of future directions for research.

In the context of this paper, a graph is a triple , where is a finite set of nodes, is a directed edge relation, and is a function with domain that assigns a label to every node . With a slight abuse of terminology, we call a child of , and a parent of , if and only if contains an edge . Let be the set of all children of , and let be the set of all parents of . Note that in our work we only consider acyclic graphs. Furthermore, we assume that the node set is ordered in reverse topological order, that is, children always precede their parents in the order. Assuming a topological ordering is standard in the design of external memory DAG algorithms [?]. Indeed, real world data is often already ordered (e.g., XML documents), and, furthermore, practical approaches to topological sorting of massive data sets are available [?].

  • Let and be two, possibly the same, graphs. Nodes and are bisimilar to each other, denoted , if and only if:

    1. the nodes have the same label: ;

    2. for every node there is a node such that , and:

    3. For every node there is a node such that .


    We can extend this notion to complete graphs as follows:

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description