ProbeSim: Scalable SingleSource and Top SimRank Computations on Dynamic Graphs
Abstract
Singlesource and top SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads.
This paper presents ProbeSim, an indexfree algorithm for singlesource and top SimRank queries that provides a nontrivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support realtime SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to nontrivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.
ProbeSim: Scalable SingleSource and Top k SimRank Computations on Dynamic Graphs \vldbAuthorsYu Liu, Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai Zheng, Jiaheng Lu \vldbVolume11 \vldbNumber1 \vldbYear2017 \vldbDOIhttps://doi.org/10.14778/3136610.3136612
1
1 Introduction
SimRank [10] is a classic measure of the similarities of graph nodes, and it has been adopted in numerous applications such as web mining [11], social network analysis [16], and spam detection [24]. The formulation of SimRank is based on two intuitive statements: (i) a node is most similar to itself, and (ii) two nodes are similar if their neighbors are similar. Specifically, given two nodes and in a graph , the SimRank similarity of and , denoted as , is defined as:
(1) 
where denotes the set of inneighbors of , and is a decay factor typically set to 0.6 or 0.8 [10, 18].
Computing SimRank efficiently is a nontrivial problem that has been studied extensively in the past decade. The early effort [10] focuses on computing the SimRank similarities of all pairs of nodes in , but the proposed Power Method algorithm incurs prohibitive overheads when the number of nodes in is large, as there exists node pairs in . To avoid the inherent costs in allpair SimRank computation, the majority of the subsequent work considers two types of SimRank queries instead

Singlesource SimRank query: given a query node , return for every node in ;

Top SimRank query: given a query node and a parameter , return the nodes with the largest .
Existing techniques [12, 19, 26, 6, 13, 23, 30, 15] for these two types of queries, however, suffer from two major deficiencies. First, most methods [19, 13, 15, 23, 30] fail to provide any worstcase guarantee in terms of the accuracy of query results, as they either rely on heuristics or adopt an incorrect formulation of SimRank. Second, the existing solutions [26, 6, 15] with theoretical accuracy guarantees, all require constructing index structures on the input graphs with a preprocessing phase, which incurs significant space and precomputation overheads. The only exception is the Monte Carlo method in [6] which provides an indexfree solution with theoretical accuracy guarantees. While pioneering, unfortunately, this method entails considerable query overheads, as shown in [23].
Motivations. In this paper, we aim to develop an indexfree solution for singlesource and top SimRank queries with provable accuracy guarantees. Our motivation for devising algorithms without preprocessing is twofold. First, indexbased SimRank methods often have difficulties handling dynamic graphs. For example, SLING [26], which is the stateofart indexingbased SimRank algorithm for static graphs, requires its index structure to be rebuilt from scratch whenever the input graph is updated, and its index construction requires several hours even on mediumsize graphs with 1 million nodes. This renders it infeasible for realtime queries on dynamic graphs. In contrast, indexfree techniques can naturally support realtime SimRank queries on graphs with frequent updates. To the best of our knowledge, the TSF method [23] is the only indexing approach that allows efficient update. However, TSF is unable to provide any worstcase guarantees in terms of the accuracy of the SimRank estimations, which leads to unsatisfying empirical effectiveness, as shown in [33] and in our experiments.
Our second motivation is that indexbased SimRank methods often fail to scale to large graphs due to their space overheads. For example, TSF requires an index space that is two to three orders of magnitude larger than the input graph size, and our empirical study shows that it runs out of 64GB memory for graphs over 1GB in our experiments. This renders it only applicable on small to mediumsize datasets. Further, if one considers to move the large index to the external memory, this idea would incur expensive preprocessing and query costs, as shown in our empirical study. In contrast, an indexfree solution proposed in this paper does not increase the size of an original graph.
Our contributions. This paper presents an indepth study on singlesource and top SimRank queries, and makes the following contributions. First, for singlesource and top SimRank queries, we propose an algorithm with provable approximation guarantees. In particular, given two constants and , our algorithm ensures that, with at least probability, each SimRank similarity returned has at most absolute error. The algorithm runs in expected time, and it does not require any index structure to be precomputed on the input graph. Our algorithm matches the stateoftheart indexfree solution in terms of time complexity, but it offers much higher practical efficiency due to an improved algorithm design and several nontrivial optimizations.
Our second contribution is a large set of experiments that evaluate the proposed solutions with the state of the art on benchmark datasets. Most notably, we present the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling borrowed from the information retrieval community. The results demonstrate that our solutions significantly outperform the existing methods in terms of both efficiency and effectiveness. In addition, our solutions are more scalable than the stateoftheart indexbased techniques, in that they can handle large graphs on which the existing solutions require excessive space and time costs in preprocessing.
2 Preliminaries
2.1 Problem Definition
Table 1 shows the notations that are frequently used in the remainder of the paper. Let be a directed simple graph with and . We aim to answer approximate singlesource and top SimRank queries, defined as follows:
Definition 1 (Approximate SingleSource Queries)
Given a node in , an absolute error threshold , and a failure probability , an approximate singlesource SimRank query returns an estimated value for each node in , such that
holds for any with at least probability.
Definition 2 (Approximate Top Queries)
Given a node in , a positive integer , an error threshold , and a failure probability , an approximate top SimRank query returns a sequence of nodes and an estimated value for each , such that the following equations hold with at least probability for any :
where is the node in whose SimRank similarity to is the th largest.
Essentially, the approximate top query for node returns nodes such that their actual SimRank similarities with respect to are close to those of the actual top nodes. It is easy to see that an approximate singlesource algorithm can be extended to answer the approximate top queries, by sorting the SimRank estimations and output the top results. Therefore, our main focus is on designing efficient and scalable algorithms that answer approximate singlesource queries with guarantee.
2.2 SimRank with Random Walks
In the seminal paper [10] that proposes SimRank, Jeh and Widom show that there is an interesting connection between SimRank similarities and random walks. In particular, let and be two nodes in , and (resp. ) be a random walk from that follows a randomly selected incoming edge at each step. Let be the smallest positive integer such that the th nodes of and are identical. Then, we have
(2) 
where is the decay factor in the definition of SimRank (see Equation 1).
Subsequently, it is shown in [26] that Equation 2 can be simplified based on the concept of walks, defined as follows.
Definition 3 (walks)
Given a node in , an walk from is a random walk that follows the incoming edges of each node and stops at each step with probability.
A walk from can be generated as follows. Starting from , when visiting node , we generate a random number in and check whether . If so, we terminate the walk at ; otherwise, we select one of the inneighbors of uniformly at random and proceed to that node.
Let and be two walks from two nodes and , respectively. We say that two walks meet, if there exists a positive integer such that the th nodes of and are the same. Then, according to [26],
(3) 
Based on Equation 3, one may estimate using a Monte Carlo approach [6, 26] as follows. First, we generate pairs of walks, such that the first and second walks in each pair are from and , respectively. Let be the number of walk pairs that meet. Then, we use as an estimation of . By the Chernoff bound (see the full version of the paper), it can be shown that when with at least probability we have In addition, the expected time required to generate walks is , since each walk has nodes in expectation.
The above Monte Carlo approach can also be straightforwardly adopted to answer any approximate singlesource SimRank query from a node . In particular, we can generate walks from each node, and then use them to estimate for every node in . This approach is simple and does not require any precomputation, but it incurs considerable query overheads, since it requires generating a large number of walks from each node.
2.3 Competitors
The TopSim based algorithms. To address the drawbacks of the Monte Carlo approach, Lee et al. [13] propose TopSimSM, an indexfree algorithm that answers top SimRank queries by enumerating all short random walks from the query node. More precisely, given a query node and a number , TopSimSM enumerates all the vertices that reach by at most hops, and treat them as potential meeting points. Then, TopSimSM enumerates, for each meeting point , the vertices that are reachable from within hops. Lee et al. [13] also propose two variants of TopSimSM, named TrunTopSimSM and PrioTopSimSM, which trade accuracy for efficiency. In particular, TrunTopSimSM omits the meeting points with large degrees, while PrioTopSimSM prioritizes the meeting points in a more sophisticated manner and explore only the highpriority ones.
For each node returned, TopSimSM provides an estimated SimRank that equals the SimRank value approximated using the Power Method [10] with iterations. When is sufficiently large, can be an accurate approximation of . However, Lee et al. [13] show that the query complexity of TopSimSM is time, where is the average indegree of the graph. As a consequence, Lee et al. [13] suggests setting to achieve reasonable efficiency, in which case the absolute error in each SimRank score can be as large as , where is the decay factor in the definition of SimRank 1. Meanwhile, TrunTopSimSM and PrioTopSimSM does not provide any approximation guarantees even if is set to a large value, due to the heuristics that they apply to reduce the number of meeting points explored.
The TSF algorithm. Very recently, Shao et al. [23] propose a twostage randomwalk sampling framework (TSF) for top SimRank queries on dynamic graphs. Given a parameter , TSF starts by building oneway graphs as an index structure. Each oneway graph is constructed by uniformly sampling one inneighbor from each vertex’s incoming edges. The oneway graphs are then used to simulate random walks during query processing.
To achieve high efficiency, TSF approximates the SimRank score of two nodes and as
which is an over estimation of the actual SimRank. (See Section 3.3 in [23].) Furthermore, TSF assumes that every random walk in a oneway graph would not contain any cycle, which does not always hold in practice, especially for undirected graphs. (See Section 3.2 in [23].) As a consequence, the SimRank value returned by TSF does not provide any theoretical error assurance.
Notation  Description 
the input graph  
the numbers of nodes and edges in  
the set of inneighbors of a node in  
the SimRank similarity of two nodes and in  
an estimation of  
a walk from a node  
the decay factor in the definition of SimRank  
the maximum absolute error allowed in SimRank computation  
the failure probability of a Monte Carlo algorithm 
3 ProbeSim Algorithm
In this section, we present ProbeSim, an indexfree algorithm for approximate singlesource and top SimRank queries on large graphs. Recall that an approximate singlesource algorithm can be extended to answer the approximate top queries, by sorting the SimRank estimations and output the top results. Therefore, the ProbeSim algorithm described in this section focuses on approximate singlesource queries with guarantee. Before diving into the details of the algorithm, we first give some highlevel ideas of the algorithm.
3.1 Rationale
Let and be two walks from two nodes and , respectively. Let be the th node in . (Note that .) By Equation 3,
(4) 
In other words, for a given , if we can estimate the probability that an walk from first meets at , then we can take the sum of the estimated probabilities over all as an estimation of . Towards this end, a naive approach is to generate a large number of walks from , and then check the fraction of walks that first meet at . However, if is small, then most of the walks is “wasted” since they would not meet . To address this issue, our idea is as follows: instead of sampling walks from each to see if they can “hit” any , we start a graph traversal from each to identify any node that has a nonnegligible probability to “walk” to . Intuitively, this significantly reduces the computation cost since it may enable us to omit the nodes whose SimRank similarities to are smaller than a given threshold .
In what follows, we first explain the details of the traversalbased algorithm mentioned above, and then analyze its approximation guarantee and time complexity. For convenience, we formalize the concept of firstmeeting probability as follows.
Definition 4 (firstmeeting probability)
Given a reverse path and a node , , the firstmeeting probability of with respect to is defined to be
where is a random walk that starts at .
Here, the subscript in indicates that the randomness arises from the choices of walk . In the remainder of the paper, we will omit this subscript when the context is clear.
3.2 Basic algorithm
We describe our basic algorithm for the ProbeSim algorithm. Given a node , a sampling error parameter and a failure probability , the algorithm returns , an hash_set of nodes in and their SimRank estimations. For EVERY node , , algorithm 1 returns an estimated SimRank to the actual SimRank with guarantee . Note that the basic algorithm uses unbiased sampling to produce the estimators, thus we can set .
The pseudocode for the basic ProbeSim algorithm is illustrated in Algorithm 1. The algorithm runs independent trials (Line 1). For the th trial, the algorithm generates a walk (Lines 23), and invokes the PROBE algorithm on partial walk for (Lines 56). The PROBE algorithm computes a for each node . As we shall see later, is equal to , the firstmeeting probability of with respect to partial walk . Let denote the score computed by the PROBE algorithm on partial walk , for . The algorithm sums up all scores to form the estimator (Lines 711).
Finally, for each node , we take the average over the independent estimators to form the final estimator . Note that if we take the average after all trials finish, it would require space to store all the values. Thus, we dynamically update the average estimator for each trial (Lines 1216). After all trials finishes, we return as the SimRank estimators for each node (Line 17).
Deterministic PROBE Algorithm. We now give a simple deterministic PROBE algorithm for computing the scores in Algorithm 1. Given a partial walk that starts at , the PROBE algorithm outputs , a hash_set of nodes and their firstmeeting probability with respect to reverse path .
The pseudocode for the algorithm is shown in Algorithm 2. The algorithm initializes hash tables (Line 1) and adds to (Line 2). In the th iteration, for each node in , the algorithm finds each outneighbour , and checks if (Lines 35). If so, the algorithm adds to (Lines 67). Otherwise, it adds to (Lines 89). We note that in this iteration, no score is added to , which ensures that the walk avoids at (Line 5).
The intuition of the PROBE algorithm is as follows. For the ease of presentation, we let denote the score computed on the th iteration for node . One can show that is in fact equal to , the firstmeeting probability of each node with respect to reverse path . Consequently, after the th iteration, the algorithm computes , the firstmeeting probability of each node with respect to reverse path . We will make this argument rigorous in the analysis.
Running Example for Algorithm 1 and 2. Throughout the paper, we will use a toy graph in Figure 2 to illustrate our algorithms and pruning rules. For ease of presentation, we set the decay factor so that . The SimRank values of each node to is listed in Table 2, which are computed by the Power Method within error.
a  b  c  d  e  f  g  h  
1.0  0.0096  0.049  0.131  0.070  0.041  0.051  0.051 
Suppose at the th trial, a random walk is generated. Figure 2 illustrates the traverse process of the deterministic PROBE algorithm. For simplicity, we only demonstrate the traverse process for partial walk , which is represented by the rightmost tree in Figure 2. The algorithm first inserts to . Following the outedges of , the algorithm finds and omits it as . Next, the algorithm finds , computes , and insert to . Similarly, the algorithm finds and , and inserts and to . For the next iteration, we find , , and from the outneighbours of , and . Note that is omitted due to the fact that . The score of at this iteration is computed by
Similarly, the algorithm computes , and , and insert , , and to . Finally, for the last iteration, the algorithm computes and , and returns as the results.
To get an estimation from walk
,
Algorithm 1 will invoke PROBE for
, and . Each probe gives score set: , and
. As an example,
the estimator is computed by summing up all scores for
, which equals to .
By summing all scores up from different nodes
in , the returned estimation of SimRank scores are
and
.
3.3 Analysis
Time Complexity. We notice that in each iteration of the PROBE algorithm, each edge in the graph is traversed at most once. Thus the time complexity of the PROBE algorithm is , where is the length of the partial walk . Consequently, the expected time complexity of probing a single walk in Algorithm 1 is bounded by where is the length of the walk . We notice that each step in the walk terminates with probability at least , so is bounded by a geometric distributed random variable with successful probability (here “success” means the termination of the walk). It follows that
Therefore, the expected running time of Algorithm 1 on a single walk is . Summing up for walks follows that the expected running time of Algorithm 1 is bounded by .
Correctness. We now show that Algorithm 1 indeed gives an good estimation to the SimRank values for each , . The following Lemma states that each trial in Algorithm 1 gives an unbiased estimator for the SimRank value .
Lemma 1
For any and , Algorithm 1 gives an estimator such that .
We need the following Lemma, which states that if we start a walk , then the score computed by the PROBE algorithm on partial walk is exactly the probability that and first meet at .
Lemma 2
For any node , , after the iteration, is equal to , the firstmeeting probability of with respect to partial walk .
We prove the following claim: Let denote the score of after the th iteration. Fix a node , . After the th iteration in Algorithm 2, we have , the firstmeeting probability of with respect to reverse path . Recall that
where is a random walk that starts at .
Note that if above claim is true, then after the th iteration, we have , and the Lemma will follow. We prove the claim by induction. After the th iteration, we have and for , so the claim holds. Assume the claim holds for the th iteration. After the th iteration, for each , , the algorithm set by equation
(5) 
By the induction hypothesis we have , and thus
(6) 
Here denotes the probability that selects at the first step. On the other hand, can be expressed the summation of probabilities that first select a node that is not , and then select a reverse walk from to of length and avoid at th step. It follows that
(7) 
With the help of Lemma 2, we can prove Lemma 1: {proof}[of Lemma 1] Let denote the set of all possible walks that starts at . Fix a walk . By Lemma 2, the estimated SimRank can be expressed as . Thus we can compute the expectation of this estimation by
(8) 
where is the probability of walk . Recall that is the probability that a random walk first meet at . For the ease of presentation, let denote that indicator variable that two walk and first meet at . In other word, if and first meet at , and if otherwise. We have
(9) 
Combining equation (8) and (9), it follows that
Note that is the probability that and meet, and the Lemma follows.
By Lemma 1 and Chernoff bound, we have the following Theorem that states by performing independent trials, the error of the estimator provided by Algorithm 1 can be bounded with high probability.
Theorem 1
For every node , , Algorithm 1 returns an estimation for such that
We need the following form of Chernoff bound:
Lemma 3 (Chernoff Bound [4])
For any set () of i.i.d. random variables with mean and ,
[of Theorem 1] We first note that in each trial , the estimator is a value in . It is obvious that . To see that , notice that is a probability. More precisely, it is the probability that a walk meets with walk using the same steps.
Thus, the final estimator is the average of i.i.d. random variables whos values lie in the range . Thus, we can apply Chernoff bound:
Recall that , and notice that , it follows that
Taking union bound over all nodes follows that
and the Theorem follows.
4 Optimizations
We present three different optimization techniques to speed up our basic ProbeSim algorithm. The pruning rules eliminates unnecessary traversals in the PROBE algorithm, so that a single trial can be performed more efficiently. The batch algorithm builds a reachability tree to maintain all walks, such that we do not have to perform duplicated PROBE operations in multiple trials. The randomized PROBE algorithm reduces the worstcase time complexity of our algorithm to in expectation.
4.1 Pruning
Although the expected steps of a walks is , we may still find some long walks during a large number of trials. To avoid this overhead, we add the following pruning rule:
Pruning Rule 1
Let be the termination parameter to be determined later. In Algorithm 1, truncate all walks at step .
We explain the intuition of this pruning rule as follows. Let denote the walk, and denote a node on the walk with . For each node , , the probability that a walk meets at is at most , which means that will contribute at most to the SimRank . Summing up over results in an error of . As we shall see in Theorem 2, more elaborated analysis would show that the error contributed by this pruning rule is in fact bounded by . We further notice that it is onesided error, we can add to each estimator, which will reduce the pruning error by a factor of 2.
The next pruning rule is inspired by the fact that the PROBE algorithm may traverse many vertice with small scores, which can be ignored for the estimation:
Pruning Rule 2
Let denote the pruning parameter to be determined later. In Algorithm 2, after computing all in and before descending to , we remove from if .
The intuition of pruning rule 2 is that after more iterations in Algorithm 2, the scores computed from will drop down to . One might think that a node may get multiple error contributions from different pruned nodes; However, the key insight is that the probabilities of the walks from to these nodes sum up to at most , which implies that the error introduced by a single probe is at most . We will make this argument rigorous in Theorem 2.
Running Example for the Pruning Rules. Consider