Dolha - an Efficient and Exact Data Structure for Streaming Graphs

Dolha - an Efficient and Exact Data Structure for Streaming Graphs

  
Abstract

A streaming graph is a graph formed by a sequence of incoming edges with time stamps. Unlike static graphs, the streaming graph is highly dynamic and time related. In the real world, the high volume and velocity streaming graphs such as internet traffic data, social network communication data and financial transfer data are bringing challenges to the classic graph data structures. We present a new data structure: double orthogonal list in hash table (Dolha) which is a high speed and high memory efficiency graph structure applicable to streaming graph. Dolha has constant time cost for single edge and near linear space cost that we can contain billions of edges information in memory size and process an incoming edge in nanoseconds. Dolha also has linear time cost for neighborhood queries, which allow it to support most algorithms in graphs without extra cost. We also present a persistent structure based on Dolha that has the ability to handle the sliding window update and time related queries.

Dolha - an Efficient and Exact Data Structure for Streaming Graphs


Fan Zhang
Peking University
Beijing, China, 100080
zhangfanau@pku.edu.cn
Lei Zou
Peking University,
Beijing, China, 100080
zoulei@pku.edu.cn
Li Zeng
Peking University
Beijing, China, 100080
li.zeng@pku.edu.cn and
Xiangyang Gou
Peking University
Beijing, China, 100080
gxy1995@pku.edu.cn


\@float

copyrightbox[b]

\end@float

In the real world, billions of relations and communications are created every day. A large ISP needs to deal about packets of network traffic data per hour per router [?]; 100 million users log on Twitter with around 500 million tweets per day [?]; In worldwide, the total number of sent/received emails are more than 200 billion per day [?]. Those relations are coming and fading away like the tides and mining knowledge from the highly dynamic graph data is as difficult like capturing the certain wave of the sea. To handle this situation, we need a graph data structure that has high memory efficiency to contain the enormous amount of data and high speed to seize every nanosecond of the stream.

There have been several prior arts in streaming graph summarization like TCM [?] and and specific queries like TRIÈST [?]. However, there are some complicated situations that these existing work did not cover. To illustrate our problem in this paper, we first give some motivation examples as follows:

Use Case 1: Network traffic. The network traffic is a typical kind of streaming graphs. Each IP address indicates one vertex and the communication between two IPs indicates an edge. Along with the data packets sending and receiving, the graph changes rapidly. To monitor this network, we need to run queries on this streaming graph. For example, to detect the suspects of cyber-attack, we want to know how many data packets each IP sends or receives and how many KBs data each edge carries. This problem is defined as vertex query and edge query and could be solved by the graph summarization system [?] in time cost. However, if we need more structure-aware query answers, such as ”who are the receivers of given IP?”, ”who are the 2-hop neighbors of this IP?” and ”how many IPs that this IP could reach?”, the existing graph summarization techniques (such as TCM [?]) cannot provide accurate query answers. In some applications, an exact data structure is desirable for streaming graphs rather than probabilistic data structure.

Use Case 2: Social network. In a social network graph, a user is considered as one vertex and the relations are the edges from this user. One of the most common queries is triangle counting and there are many algorithms to deal with this problem. But existing solutions are designed specifically for triangle counting [?] and so are some continuous subgraph matching systems [?] and circle detecting systems [?] over streaming graphs. If we want to run different kinds of dynamic graph analysis, we have to maintain multiple streaming systems that are costly on both space and time. An elegant solution is one uniform system that could support most graph analysis algorithms on streaming graphs.

Usually, an edge in streaming graph is received with a time-stamp indicating the edge arrive time. Some applications need to figure out historical information or time constrains based on these time stamps, however few systems support these time-related graph queries for historical information. Here are two examples:

Use Case 3: Financial transaction. For example, a bank has a streaming graph system to monitor last seven days’ money transactions. Each customer is recorded as one vertex, and each money tracer is recorded as one edge. On Friday, the bank receives a notice from another branch that a few suspicious transfers are made on Tuesday between 10am and 4pm from this bank.To find these suspicious transfers, the bank needs to run some pattern match on the time constrained transfers. In this case, we need a streaming graph system not only supports last snapshot-based queries but also enable time-related queries to figure out historical information.

Use case 4: Fraud detection. The same bank from Case 3 receives another report from police. The report has a list of suspicious accounts that may involves credit card fraud and the money transfer pattern they use. The bank needs to find when such pattern appeared in the transaction record among those accounts. Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows an example of credit card fraud pattern. In this case, we have the bank’s account ID, merchant account ID and a list of suspicious accounts ID that had transactions with the merchant account. Consider these IDs as vertex and the transactions as edges, we could construct a set of query graphs. We need to locate the occurrence time when these query graphs appear in the streaming transaction graph, then we check inward and outward neighbors of these suspicious accounts near that time and find other criminal group members.

Figure \thefigure: Credit card fraud in transactions (Taken from [?])

Motivated by above use cases, an efficient streaming grap structure should satisfy the following requirements:

  • To enable efficient graph computing, the space cost of the data structure should be small enough to fit into main memory;

  • For the enormous amount of data and the high-frequency updating, the data structure must have time cost to handle one incoming edge processing;

  • The data structure should support many kinds of graph algorithms rather than designed for one specific graph algorithm;

  • The data structure should also support time-related queries for historical information.

In the literature, there exist some streaming graph data structures. Generally speaking, they are classified into two categories: general streaming graph data structure and a data structure designed for some specific graph algorithms. General streaming graph structures are designed to preserve the whole structure of streaming graphs, thus, they can support most of graph algorithms like BFS, DFS, reachability query and subgraph matching by using neighbor search primitives. Most of these kind of structures are based on hash map associated with some classical graph data structures such as adjacency matrix and adjacency list. GraphStream Project [?] is based on adjacency list associated with hash map. The basic idea of this structure is to map the vertex IDs into a hash table. Each cell of vertex hash table stores the vertex ID and the incoming/outgoing links. TCM [?] and gMatrix [?] propose to combine hash map with adjacency matrix. Different from [?], TCM and gMatrix are approximate data structures that inherit query errors due to hash conflicts. There are also some other streaming graph data structures that support a single specific graph algorithm, such as HyperANF [?] for t-hop neighbor and distance query, the Single-Sink DAG [?] for pattern matching and TRIEST [?] for triangle counting.

Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs lists the space cost of different general streaming graph data structure together with the time complexity to handle edge insertion and edge/1-hop queries. GraphStream’s edge insertion time is , which depends on the maximum vertex degree. In many scale-free network data, the maximum vertex degree is often very large. Thus, GraphStream is not suitable for high speed streaming graph. TCM and gMatrix have the square space cost that prevents them to be used in large graphs. On the contrary, our proposed approach (called Dolha) in this paper fits all requirements for streaming graphs. Generally, Dolha is the combination of the orthogonal list with hash techniques. The orthogonal list builds two single linked lists of the outgoing and incoming edges for each vertex and store the first items of two list in vertex cell. On the other hand, the hash table is commonly used for streaming data structure to achieve amortized time look up, such as bloom filter [?] and count-min [?]. The combination of orthogonal list and hash table is an promising option to achieve our goal. Based on this idea, we present a new exact streaming graph structure: double orthogonal list in hash table (Dolha).

Adjacency List Adjacency Matrix Orthogonal List
+Hash GraphStream [?] TCM [?] Dolha
Space Cost
Time Cost per Edge
Edge Query
1-hop Neighbor Query
Table \thetable: General Streaming Graph Structures

Our Contributions: Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows the comparison among the three general streaming graph structures. In this paper:

  1. We design an effective data structure (Dolha) for streaming graphs with space cost and time cost for a single edge operation. Compared with existing data structures, Dolha is more suitable in the context of high speed straming graph data.

  2. The Dolha data structure can answer many kinds of queries over streaming graphs, among which Dolha support the query primitive edge query in time and 1-hop neighbor queries in time.

  3. We present a variant of Dolha, Dolha persistent that supports sliding window and time related queries in linear time cost.

  4. Extensive experiments over both real and synthetic datasets confirm the superiority of Dolha over the state-of-the-arts.

Among the existing studies, we categorize the structures into two classes: general streaming graph structures and streaming graph algorithms structures.

General streaming graph structures are designed to preserve the data of graph stream and maintain the graph connection information at the same time. A general streaming graph structures could support most of graph algorithms like BFS, DFS, reachability query and subgraph matching by using neighbor search primitives. Most of these kind of structures are based on hash map associated with basic graph data structure like adjacency matrix and adjacency list. There are two different general streaming graph structures: exact structure and approximation structure.

Exact Structures: Graph Stream Project [?] is an exact graph stream processing system which is implanted by Java. Graph Stream Project is based on adjacency list associated with hash map and it supports most of graph algorithms. The basic idea of this structure is to map the vertex IDs into a hash table. Each cell of vertex hash table stores the vertex ID and the incoming / outgoing links.

Adjacency list needs space and time for traversal. However, to locate an edge, we need to go through the neighbor lists pf both in and out vertices which indicates time cost in some extreme situations. Even we put the neighbor list into a sorted list, it still costs time ( is the average degree of vertices) for each edge look up.

Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows an example of adjacency list in hash table. We use hash function to map the vertices into cells vertex hash table and each cell has sorted list to store the outgoing and incoming neighbors of the vertex. i.e., and cell stores the vertex ID , the outgoing list and incoming list . The adjacency list stores the exact information of the graph stream but cost for each edge insertion.

Figure \thefigure: Example of adjacency list in hash table

Approximation Structures: Another solution for the structure of streaming graph is adjacency matrix in hash table. We could hash the vertices into a hash table and using a pair of vertices indexes as coordinates to construct an adjacency matrix. Vertex query in hash table is time cost and so is edge look-up in the matrix. From the view of time cost, adjacency matrix in hash table is efficient but space cost is a drawback. In the real world, graphs are usually sparse and we could not afford to spend quadrillion on a million vertices graph. There is a compromise formula that we compress the vertices into size or even smaller hash table to reduce the space cost up to . But with the high compress ratio, it’s only suite for a graph summarization system, like TCM [?], gMatrix [?].

Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows an example of adjacency matrix in hash table. We use hash function to map the vertices into cells hash table and use the table index to build a matrix. In the cells of the matrix, we store the weights of edges. i.e., and , the matrix table cell indicates the edge . However, the cell also indicates the edge and since the hash collision. If we do outgoing neighbor query for , the result is and the correct answer is . In this case, if we want the exact result, the matrix size is which is much larger than the edge size .

Figure \thefigure: Example of adjacency matrix in hash table

Unlike the general structure, there are some data structures designed for specific algorithms on graph stream. For example, HyperANF [?] is an approximation system for t-hop neighbor and distance query; the Single-Sink DAG [?] is for pattern matching on large dynamic graph; TRIÈST [?] is sampling system for triangle counting in streaming graph; and there are some connectivity and spanners structures showed in Graph stream survey [?]. These systems could only support the designed algorithms and become incapable or unacceptable on other graph queries.

Time constrained continuous subgraph search over streaming gra- phs [?] is a rare and the latest research work that considers the time as query parameter. This paper proposed an a kind of query that requires not only the structure matching but also the time order matching. Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows an example of time constrained subgraph query. In this query, each edge of query graph has a time-stamp constrain . A matching subgraph means the subgraph is an isomorphism of query graph and the time-stamps are following the given order.

(a) query graph
(b) timing order
Figure \thefigure: Running example query (Taken from [?])
Definition 1 (Streaming Graph)

A streaming graph is a directed graph formed by a continuous and time-evolving sequence of edges . Each edge from vertex to is arriving at time with weight , denoted as , .

Figure \thefigure: Streaming Graph

Generally, there are two models of streaming graphs in the literature. One is only to care the latest snapshot structure, where the latest snapshot is the superposition of all coming edges to the latest time point. The other model records the historical information of the streaming graphs. The two models are formally defined in Definitions 2 and 4, respectively. In this paper, we propose a uniform data structure (called Dolha) to support both of them.

Definition 2 (Snapshot & Latest Snapshot Structure)

An
edge may appear in multiple times with different weights at different time stamps. Each occurrence of is denoted as , . The total weight of edge at snapshot is the weight sum of all occurrences before (and including) time point , denoted as

where appears in streaming graph .

For a streaming graph , the corresponding snapshot at time point (denoted as ) is a set of edges that has positive total weight at time :

When is the current time point, denotes the the latest snapshot structure of .

(a)
(b)
(c)
(d)
(e)
(f)
Figure \thefigure: snapshot to snapshot of streaming graph

An example of streaming graph is shown in Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs. Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows the snapshots of from to . In Figure (c), total edge weight is updated from (at time ) to (at time ). In Figure (d), edge receives a negative weight update. Since the weight of is after update, it means that it is deleted from the snapshot at time . In Figure (e), the deletion of edge causes the deletion of vertex in and is added into again because the new edge incoming at .

In some applications, we need to record the historical information of streaming graphs, such as fraud detection example (Use Case 4) in Section Dolha - an Efficient and Exact Data Structure for Streaming Graphs. Thus, we also consider the sliding window-based model.

Definition 3 (Sliding Window)

Let be the starting time of a streaming graph and be the window length. In every update, the window would slide and . contains all edges in the -th sliding window, denoted as:

[?]

In Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs, the window size and each step the window slides edges. Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs illustrates the first and the second sliding window, where the left-most three edges expired in the second window.

Figure \thefigure: Sliding window update on streaming graph
Definition 4 (Window Based Persistent Structure)

Given a streaming graph , the Window Based Persistent Structure (“persistent structure” for short) is a graph formed by all the unexpired edges in the current time window. Each edge is associated with the time stamps denoting the arriving times of the edge. An edge may have multiple time stamps due to the multiple occurrences.

Figure \thefigure: Window based persistent structure

In a snapshot streaming graph structure, only the latest snapshot is recorded and the historical information is overwritten. For example, a snapshot structure only stores the snapshot at last time point in Figure (f). The update process of the streaming graph is overwritten.

Assume that the second time window (Window 1) is the current window. Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows how the persistent structure stores the streaming graph. Edge is associated with three time points (, and ) that are all in the current time window. Although edge also occurs at time , it is expired in this time window. The gray edges denotes all expired edges, such as and .

Definition 5 (Streaming graph query primitives)

We define 4 query primitives for streaming graph and most of the graph algorithms such as DFS, BFS, reachability query and subgraph matching are based on these query primitives:

  1. Edge Query: Given the a pair of vertices IDs , return the weight or time stamp of the edge . If the edge doesn’t exist, return .

  2. Vertex Query: Given the a vertex IDs , return the incoming or outgoing weight of . If the vertex does not exist, return .

  3. 1-hop Successor Query: Given the a vertex IDs , return a set of vertices that could reach in 1-hop. If there is no such vertex, return .

  4. 1-hop Precursor Query: Given the a vertex IDs , return a set of vertices that could reach in 1-hop. If there is no such vertex, return .

The query primitives are slightly different in two structures. If we query edge in snapshot structure at , the result is the last updated edge information : . If we query edge in persistent structure at showing in Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs, the result is a list of unexpired edges: , , . The same difference applies to 1-hop successor query and precursor query. If we query the successor of at , the snapshot structure will give the answer . But the persistent structure will return a set of answers: , , .

Based on the persistent structure query primitives, we define a new type of queries on streaming graph named time related query that considers the time stamps as query parameters. In this paper, we adopt two kinds of time related queries: time constrained pattern query is to find the match subgraph in a given time period; structure constrained time query is to find the time periods that given subgraph appears in .

Definition 6 (Time Constrained Pattern Query)

A pattern
graph is a triple , where is a set of vertices in , is a set of directed edges, is a function that assigns a label for each vertex in . Given a pattern graph and a time period and , is a time constrained pattern match of if and only if there exists a bijective function from to such that the following conditions hold:

  1. Structure Constraint (Isomorphism)

    • .

    • .

  2. Time Period Constraint

    • [?]

In this paper, the problem is to find all the time constrained pattern matches of given over which is the snapshot of at time .

(a) Pattern graph
(b) Pattern match
Figure \thefigure: Time constrained pattern query

Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows an example of time constrained pattern query. In Figure (a), a pattern graph is given which queries all the 2-hop connected structures. The edges of pattern graph have a time constrain that only the edges with the time stamp between are considered as match candidates. Figure (b) is the snapshot of at time . Edge and are discarded since the time stamps are out of time constrain. Edge set is the only matching subgraph for the given pattern on .

Definition 7 (Structure Constrained Time Quer
y)

A query graph is a sequence of directed edges and is a set of time pairs . Given a pattern graph , a structure constrained time match is that is the subgraph of every snapshot of during any time period in .

Figure \thefigure: Structure constrained time query

Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs gives an example of structure constrained time query edge set is given. On , the query graph is the subgraph of every snapshot from to until deletion of on . In , the query graph is matching again since the new arriving . The query result of Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs is .

Notation Definition and Description
/ Streaming graph / Snapshot at time point
/ Dolha snapshot / Dolha persisdent
The directed edge from vertex to
Doll Doulble orthogonal linked list
Outgoing Doll
Incoming Doll
Time travel linked list
Edge weight
Edge time stamp
Hash value of
Vertex table index of
Edge table index of
First item’s edge table index of link
Last item’s edge table index of link
Last item’s edge table index of link
Next item’s edge table index of link
Previous item’s edge table index of link
Previous/next item of
Table \thetable: Notations

In order to handle high speed streaming graph data, we propose the data structure—called Double Orthogonal List in Hash Table (Dolha for short)—in this paper. Essentially, Dolha is the combination of double orthogonal linked list with hash tables. A double orthogonal linked list (Doll for short) is a classical data structure to store a graph, in which each edge in graph is both in the double linked list of all the outgoing edges from vertex : denotes as outgoing Doll and in the double linked list of all the incoming edges to vertex : denotes as incoming Doll. Vertex has two pointers to the first item and last item of outgoing Doll. Vertex has two pointers to the first item and last item of incoming Doll. For example, Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs illustrates an example of Doll.

Figure \thefigure: Example of Doll

Given a graph , the Dolha structure contains of four key-value tables. Before that, we assume that each vertex (and edge ) is hashed to a hash value (and ). For example, we use hash function to map the vertices and edges:

Vertex Hash Table: Dolha creates size vertex hash table and uses function map the vertex ID to vertex hash table index . Due to the hash collision, there could be a list of vertices with same hash table index. In each table cell, Dolha stores the vertex table index of the first vertex on collision list.

Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs is an example of vertex hash table. We use as hash index to locate the vertex table index and find the ’s details in vertex table cell . The vertex has the same hash value as which means the hash collision occurs. We use hash value to find the first vertex on the collision list then we can find the next item ’s vertex table index in ’s vertex table cell.

Vertex Table : Dolha creates size vertex table and one empty cell variable denoted as . Initially, . We denote the vertex table index for new coming vertex as . Let and increase by . In each vertex table cell, Dolha stores the vertex ID, the outgoing weight sum and incoming weight sum , the head and tail edge table index for outgoing Doll , the head and tail edge table index for incoming Doll and the vertex table index of the next vertex on collision list .

Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows the vertex table of in Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs. Out/In indicates the outgoing and incoming weights of the vertex. is the edge table index of first and last items of outgoing Doll and is the edge table index of first and last items of incoming Doll. is the next vertex on the collision list. The vertices are given indexes incrementally ordered by first arriving time. means vertex table is full. If more vertices arrive, we can create a new vertex table and begin with index as the extension of existing vertex table.

Edge Hash Table: Edge hash table: Dolha creates size vertex hash table and uses function map the outgoing vertex ID plus incoming vertex ID of edge to edge hash table index . Same as the vertex hash table, Dolha stores the edge table index of the first edge on collision list .

In Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs, we have the same method as vertex hash table to deal with hash collision. has the same hash value as . In cell , we can find ’s edge table index then find ’s edge table index.

Edge Table : Dolha creates size vertex table and one empty cell flag denoted as . Initially, . We denote the vertex table index for new coming edge as . Let and increase by 1. In each edge table cell, Dolha stores the vertex table indexes and , the weight , the time stamp , the previous and next edge table index for outgoing Doll , the previous and next edge table index for incoming Doll and the edge table index of the next edge on collision list .

Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows the edge table of in Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs. is the weight and is the time stamp. Vertex index indicates the outgoing and incoming vertices of the edge. is the edge table index of next and previous items of outgoing Doll and is the edge table index of next and previous items of incoming Doll. is the next edge on the collision list.

Hash index 0 1 2 3 4
Vertex table index
Table \thetable: Vertex hash table of
Index 0 1 2 3 4
Vertex ID
Out/In
O
I
H

Table \thetable: Vertex table of
Hash index 0 1 2 3 4 5
Edge table index
Table \thetable: Edge hash table of
Index 0 1 2 3 4 5
1 1 1 1 1 /
1 2 3 4 5 /
Vertex index
O
I
H

Table \thetable: Edge table of

When an edge comes:

  • Map the edge into edge hash table cell .

  • If is empty, does not exist in . If is not empty, traverse the collision list of cell in edge hash table. If find , exists; if not, does not exist.

There are two possible operations:

If does not exist in :

  • Add into into edge table cell and the collision list of .

  • Map the vertices , into vertex hash table ,.

  • If is empty, add ID into vertex table cell . If is not empty, traverse the collision list of cell in vertex hash table. If find match ID, then we update vertex table of ; if not, add into vertex table cell and collision list of .

  • Do the same operation for .

  • Add into the end of outgoing Doll of and incoming Doll of .

If exists in :

  • Set and .

  • Delete from outgoing Doll of and incoming Doll of

  • If has positive weight after this update:

  • Add into the end of outgoing and incoming Dolls.

  • if has zero or negative weight after this update:

  • Delete from edge table.

  • If there is not any item in both Doll of or , delete or .

Input: Streaming graph
Output: Dolha snapshot structure of
1 for each incoming edge of  do
2       Check existence of :
3       Map into .
4       if  is  then
5             does not exist
6            
7      else
8             Traverse the collision list from .
9             if reach and no match for  then
10                   does not exist
11                  
12            else
13                   exists
14                  
15      if  does not exist then
16             Update collision list of :
17             if  is empty then
18                   Let
19                  
20            else
21                   Let
22                  
23            Check existence of :
24             Map the vertices into
25             if  is  then
26                   Add into vertex table and let
27                  
28            else
29                   Traverse the collision list from .
30                   if reach and no match for  then
31                         Add into vertex table and let
32                        
33            Do the same operation for same as
34             Add into edge table
35             Add into outgoing Doll:
36             if both and are  then
37                   Let ) and
38                  
39            if neither nor is  then
40                   Let and and and
41                  
42            Add into incoming Doll same as outgoing Doll
43            
44      if  exists then
45             Let and
46             Delete from outgoing Doll:
47             if  is the first item of outgoing Doll then
48                   Let and
49                  
50            if  is the last item of outgoing Doll then
51                   Let and
52                  
53            else
54                   Let and
55                  
56            Delete from incoming Doll same as outgoing Doll
57             if  then
58                   Add into the end of outgoing Doll and incoming Doll
59                  
60            else
61                   Delete
62                   Delete and flag as empty cell
63                   if there is no item on outgoing Doll or incoming Doll of  then
64                         Delete and flag as empty cell
65                        
66                  if there is no item on outgoing Doll or incoming Doll of  then
67                         Delete and flag as empty cell
68                        
Algorithm 1 Dolha snapshot edge processing

For example, at time , edge is received. We use to get the edge hash table index and find edge is a new edge. We write the empty cell index of edge table into hash table and check the two vertices by using vertex hash table. We locate the for and for on vertex table and get the last item of outgoing Doll and the last item of incoming Doll . We update the both last items of outgoing and incoming Doll to then move to edge table. We update the next item of outgoing Doll to in and update the next item of incoming Doll to in . Finally, we write , , , and into .

At time , edge comes and it is already on the edge table. We first update the and at and remove from both of the Dolls then add it to the end of Dolls.

At time , edge carries negative weight and is after the update. We move from the outgoing and incoming doll and update the associated indexes, then we empty the cell of edge table and put the index into empty edge cell list. At time , edge is deleted and has no out or in edges. We empty cell of vertex table and put the index into empty vertex cell list.

Algorithm 1 shows how Dolha process one incoming edge.

From line to , we maintain the edge hash table to check the existence of incoming edge . According to [?], if we hash n items into a hash table of size n, the expected maximum list length is . In the experiment, more than collision list is less than , more than collision list is shorter than . Hash table could achieve amortized time cost for item insert, delete and update which is much faster than sorted table. This step costs time.

If is a new edge, from line to , we maintain the vertex hash table to check the existence of two vertices and . In this step, we do two hash table look up and it costs time. From line to , we write into edge table then add it into the end of outgoing and incoming Dolls. The time complexity of this step is same as insertion on double linked list which is .

If exists, from line to , we update the weight and time stamp of then delete it from outgoing and incoming Dolls. This step costs the same time as deletion on double linked list which is also . From line to , if updated weight is positive, we add the to the end of both two Dolls which costs . If the updated weight is zero or negative, we delete completely then delete and if they have in and out degrees. Line to shows the deletions and this step also costs .

Overall, for each incoming edge processing, the time complexity of Dolha is .

Dolha snapshot structure needs one cells vertex hash table, one cells vertex table, one cells edge hash table and one cells edge table. Dolha also needs a bits integer for one vertex index and bits for one edge index.

Vertex hash table: Each cell only stores one vertex index. It costs space.

Edge hash table: Each cell only stores one edge index. It costs space.

Vertex table: Each cell stores vertex ID, in and out weights one bits vertex index for collision list, four bits edge indexes for Dolls. It costs space.

Edge table: Each cell stores weight, time stamp, one bits edge index for collision list, two bits vertex index for in and out vertices, four bits edge indexes for Dolls. It costs space.

In total, Dolha needs bits for the data structure. Since usually , the space cost of Dolha snapshot structure is .

Using Dolha, We could construct a persistent structure and contains any snapshot’s information of . has the same structure as except the time travel list.

Definition 8 (Time Travel List)

An edge may appear in streaming graph multiple times with different time stamp. Time travel list is a single linked list that links all the edges which share same outgoing and incoming vertices. In , each edge has an index points to its previous appearance in the stream.

also has four index-value tables. The vertex hash table, vertex table and edge hash table are same as . In each cell of edge table, has a extra value which indicates the previous item on the time travel list.

When an edge comes:

  • Check the existence of same as Dolha snapshot.

If does not exist in :

  • The operation is exact same as Dolha snapshot.

If exists in :

  • Use edge hash table to find the existing edge table index of .

  • Insert edge as new edge into edge table and set the time travel list index as .

  • Update the edge table index of on edge hash collision list.

Input: Streaming graph
Output: Dolha persistent structure of
1 for each incoming edge of  do
2       Check existence of :
3       if  does not exist then
4             Insert
5            
6      if  exists in cell  then
7             Insert as new edge and let
8             Let
9            
10      if value of in edge hash table is  then
11             Let
12            
13      else
14             Let
Algorithm 2 Dolha persistent edge processing
Hash index 0 1 2 3 4 5 6 7 8 9
Edge table Index
Table \thetable: Edge hash table of Window 0
Index 0 1 2 3 4 5 6 7 8 9
1 1 1 1 1 1 2 / / /
1 2 3 4 5 6 7 / / /
V
O
I
H
T
Table \thetable: Edge table of Window 0
Index 0 1 2 3 4 5 6 7 8 9
/ / / 1 1 1 1 -1 0 /
/ / / 4 5 6 7 9 10 /
V
O
I
H
T
Table \thetable: Edge table of Window 1

Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs and Dolha - an Efficient and Exact Data Structure for Streaming Graphs show the Dolha persistent’s edge hash table and edge table of in Window 0. The vertex hash table and vertex table of Dolha persistent are similar like Dolha snapshot and so is the new edge coming. But for edge update at time , we add the update as new edge into and update the edge hash table to latest update. By using the time travel list, all the updates of are linked.

When the window slides the th step, we have the start time and end time of expired edges which need to delete from edge table. Since the edge table is naturally ordered by time, we can find the last expired edge denote as at in time. By using edge hash table, we can find the latest update of and traversal back by the time travel list. For each on time travel list, let . If each , delete all the . Then delete each on time travel list. Do the same operation for the edges from to . For every deleted edge, if it is the first or last item of Doll, update the associated cell in vertex table and set the index to . If all the Doll indexes are in that vertex cell, delete the vertex and flag the cell as empty.

As shown in Figure Dolha - an Efficient and Exact Data Structure for Streaming Graphs, when window slides from to means the edges before will expire. First, we can binary search the edge table to locate the first unexpired edge index since the table is sorted by time stamp. Then we start to delete the expired edges from cell . We use the hash table to check if there are unexpired updates for the expired edges. For example, has unexpired update at time , so we minus the expired weight at cell .

Table Dolha - an Efficient and Exact Data Structure for Streaming Graphs shows the edge table of Dolha persistent at Window 1. The first 3 expired edges have been deleted. At time , with negative weight arrives, but there is no positive in this window. In this case, we won’t save . At time and , has negative or zero weights, but has positive weight at time , so we keep the record and link them with time travel linked list.

Space Recycle: Due to the chronological ordered edge table, the expired edges are always continuous and in the head of the unexpired edges. We could always recycle the space from expired edges which means we won’t need infinite space to save the continuous streaming but only need the maximum number of edges in each window. For instance, in table Dolha - an Efficient and Exact Data Structure for Streaming Graphs, we can re-use the cell from to for next window update and we have enough space as long as no more than edges in window.

The time cost of Dolha persistent is hash table cost, Doll cost and time travel list cost. For each incoming edge, the hash table cost and Doll cost are as we discussed in Dolha snapshot and the time travel list cost is also same as insertion on single linked list. Overall, the time cost for one edge processing is .

To store all the information of streaming , Dolha persistent structure needs one cells vertex hash table, one cells vertex table, one cells edge hash table and one cells edge table. In total, Dolha needs bits plus for time travel list. The space cost of Dolha persistent structure is .

In this section, we discuss how to perform the graph algorithms on both Dolha snapshot structure and