ProbeSim: Scalable Single-Source and Top-\boldsymbol{k} SimRank Computations on Dynamic Graphs

# ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

## Abstract

Single-source and top- SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads.

This paper presents ProbeSim, an index-free algorithm for single-source and top- SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

\vldbTitle

ProbeSim: Scalable Single-Source and Top- k SimRank Computations on Dynamic Graphs \vldbAuthorsYu Liu, Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai Zheng, Jiaheng Lu \vldbVolume11 \vldbNumber1 \vldbYear2017 \vldbDOI

\numberofauthors

1

\vldbDOI

## 1 Introduction

SimRank [10] is a classic measure of the similarities of graph nodes, and it has been adopted in numerous applications such as web mining [11], social network analysis [16], and spam detection [24]. The formulation of SimRank is based on two intuitive statements: (i) a node is most similar to itself, and (ii) two nodes are similar if their neighbors are similar. Specifically, given two nodes and in a graph , the SimRank similarity of and , denoted as , is defined as:

 s(u,v)=⎧⎪⎨⎪⎩1,if u=vc|I(u)|⋅|I(v)|∑x∈I(u),y∈I(v)s(x,y),otherwise. (1)

where denotes the set of in-neighbors of , and is a decay factor typically set to 0.6 or 0.8 [10, 18].

Computing SimRank efficiently is a non-trivial problem that has been studied extensively in the past decade. The early effort [10] focuses on computing the SimRank similarities of all pairs of nodes in , but the proposed Power Method algorithm incurs prohibitive overheads when the number of nodes in is large, as there exists node pairs in . To avoid the inherent costs in all-pair SimRank computation, the majority of the subsequent work considers two types of SimRank queries instead

• Single-source SimRank query: given a query node , return for every node in ;

• Top- SimRank query: given a query node and a parameter , return the nodes with the largest .

Existing techniques [12, 19, 26, 6, 13, 23, 30, 15] for these two types of queries, however, suffer from two major deficiencies. First, most methods [19, 13, 15, 23, 30] fail to provide any worst-case guarantee in terms of the accuracy of query results, as they either rely on heuristics or adopt an incorrect formulation of SimRank. Second, the existing solutions [26, 6, 15] with theoretical accuracy guarantees, all require constructing index structures on the input graphs with a preprocessing phase, which incurs significant space and pre-computation overheads. The only exception is the Monte Carlo method in [6] which provides an index-free solution with theoretical accuracy guarantees. While pioneering, unfortunately, this method entails considerable query overheads, as shown in [23].

Motivations. In this paper, we aim to develop an index-free solution for single-source and top- SimRank queries with provable accuracy guarantees. Our motivation for devising algorithms without preprocessing is two-fold. First, index-based SimRank methods often have difficulties handling dynamic graphs. For example, SLING [26], which is the state-of-art indexing-based SimRank algorithm for static graphs, requires its index structure to be rebuilt from scratch whenever the input graph is updated, and its index construction requires several hours even on medium-size graphs with 1 million nodes. This renders it infeasible for real-time queries on dynamic graphs. In contrast, index-free techniques can naturally support real-time SimRank queries on graphs with frequent updates. To the best of our knowledge, the TSF method [23] is the only indexing approach that allows efficient update. However, TSF is unable to provide any worst-case guarantees in terms of the accuracy of the SimRank estimations, which leads to unsatisfying empirical effectiveness, as shown in [33] and in our experiments.

Our second motivation is that index-based SimRank methods often fail to scale to large graphs due to their space overheads. For example, TSF requires an index space that is two to three orders of magnitude larger than the input graph size, and our empirical study shows that it runs out of 64GB memory for graphs over 1GB in our experiments. This renders it only applicable on small- to medium-size datasets. Further, if one considers to move the large index to the external memory, this idea would incur expensive preprocessing and query costs, as shown in our empirical study. In contrast, an index-free solution proposed in this paper does not increase the size of an original graph.

Our contributions. This paper presents an in-depth study on single-source and top- SimRank queries, and makes the following contributions. First, for single-source and top- SimRank queries, we propose an algorithm with provable approximation guarantees. In particular, given two constants and , our algorithm ensures that, with at least probability, each SimRank similarity returned has at most absolute error. The algorithm runs in expected time, and it does not require any index structure to be pre-computed on the input graph. Our algorithm matches the state-of-the-art index-free solution in terms of time complexity, but it offers much higher practical efficiency due to an improved algorithm design and several non-trivial optimizations.

Our second contribution is a large set of experiments that evaluate the proposed solutions with the state of the art on benchmark datasets. Most notably, we present the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling borrowed from the information retrieval community. The results demonstrate that our solutions significantly outperform the existing methods in terms of both efficiency and effectiveness. In addition, our solutions are more scalable than the state-of-the-art index-based techniques, in that they can handle large graphs on which the existing solutions require excessive space and time costs in preprocessing.

## 2 Preliminaries

### 2.1 Problem Definition

Table 1 shows the notations that are frequently used in the remainder of the paper. Let be a directed simple graph with and . We aim to answer approximate single-source and top- SimRank queries, defined as follows:

###### Definition 1 (Approximate Single-Source Queries)

Given a node in , an absolute error threshold , and a failure probability , an approximate single-source SimRank query returns an estimated value for each node in , such that

 |~s(u,v)−s(u,v)|≤εa

holds for any with at least probability.

###### Definition 2 (Approximate Top-k Queries)

Given a node in , a positive integer , an error threshold , and a failure probability , an approximate top- SimRank query returns a sequence of nodes and an estimated value for each , such that the following equations hold with at least probability for any :

 s(u,vi)≥s(u,v′i)−εa

where is the node in whose SimRank similarity to is the -th largest.

Essentially, the approximate top- query for node returns nodes such that their actual SimRank similarities with respect to are -close to those of the actual top- nodes. It is easy to see that an approximate single-source algorithm can be extended to answer the approximate top- queries, by sorting the SimRank estimations and output the top- results. Therefore, our main focus is on designing efficient and scalable algorithms that answer approximate single-source queries with guarantee.

### 2.2 SimRank with Random Walks

In the seminal paper [10] that proposes SimRank, Jeh and Widom show that there is an interesting connection between SimRank similarities and random walks. In particular, let and be two nodes in , and (resp. ) be a random walk from that follows a randomly selected incoming edge at each step. Let be the smallest positive integer such that the -th nodes of and are identical. Then, we have

 s(u,v)=E[ct−1], (2)

where is the decay factor in the definition of SimRank (see Equation 1).

Subsequently, it is shown in [26] that Equation 2 can be simplified based on the concept of -walks, defined as follows.

###### Definition 3 (√c-walks)

Given a node in , an -walk from is a random walk that follows the incoming edges of each node and stops at each step with probability.

A -walk from can be generated as follows. Starting from , when visiting node , we generate a random number in and check whether . If so, we terminate the walk at ; otherwise, we select one of the in-neighbors of uniformly at random and proceed to that node.

Let and be two -walks from two nodes and , respectively. We say that two -walks meet, if there exists a positive integer such that the -th nodes of and are the same. Then, according to [26],

 s(u,v)=Pr[W′(u) and W′(v) meet]. (3)

Based on Equation 3, one may estimate using a Monte Carlo approach [6, 26] as follows. First, we generate pairs of -walks, such that the first and second walks in each pair are from and , respectively. Let be the number of -walk pairs that meet. Then, we use as an estimation of . By the Chernoff bound (see the full version of the paper), it can be shown that when with at least probability we have In addition, the expected time required to generate -walks is , since each -walk has nodes in expectation.

The above Monte Carlo approach can also be straightforwardly adopted to answer any approximate single-source SimRank query from a node . In particular, we can generate -walks from each node, and then use them to estimate for every node in . This approach is simple and does not require any pre-computation, but it incurs considerable query overheads, since it requires generating a large number of -walks from each node.

### 2.3 Competitors

The TopSim based algorithms. To address the drawbacks of the Monte Carlo approach, Lee et al. [13] propose TopSim-SM, an index-free algorithm that answers top- SimRank queries by enumerating all short random walks from the query node. More precisely, given a query node and a number , TopSim-SM enumerates all the vertices that reach by at most hops, and treat them as potential meeting points. Then, TopSim-SM enumerates, for each meeting point , the vertices that are reachable from within hops. Lee et al. [13] also propose two variants of TopSim-SM, named Trun-TopSim-SM and Prio-TopSim-SM, which trade accuracy for efficiency. In particular, Trun-TopSim-SM omits the meeting points with large degrees, while Prio-TopSim-SM prioritizes the meeting points in a more sophisticated manner and explore only the high-priority ones.

For each node returned, TopSim-SM provides an estimated SimRank that equals the SimRank value approximated using the Power Method [10] with iterations. When is sufficiently large, can be an accurate approximation of . However, Lee et al. [13] show that the query complexity of TopSim-SM is time, where is the average in-degree of the graph. As a consequence, Lee et al. [13] suggests setting to achieve reasonable efficiency, in which case the absolute error in each SimRank score can be as large as , where is the decay factor in the definition of SimRank 1. Meanwhile, Trun-TopSim-SM and Prio-TopSim-SM does not provide any approximation guarantees even if is set to a large value, due to the heuristics that they apply to reduce the number of meeting points explored.

The TSF algorithm. Very recently, Shao et al. [23] propose a two-stage random-walk sampling framework (TSF) for top- SimRank queries on dynamic graphs. Given a parameter , TSF starts by building one-way graphs as an index structure. Each one-way graph is constructed by uniformly sampling one in-neighbor from each vertex’s in-coming edges. The one-way graphs are then used to simulate random walks during query processing.

To achieve high efficiency, TSF approximates the SimRank score of two nodes and as

 ∑iPr[two √c-walks from u and v meet at the i-th % step],

which is an over estimation of the actual SimRank. (See Section 3.3 in [23].) Furthermore, TSF assumes that every random walk in a one-way graph would not contain any cycle, which does not always hold in practice, especially for undirected graphs. (See Section 3.2 in [23].) As a consequence, the SimRank value returned by TSF does not provide any theoretical error assurance.

## 3 ProbeSim Algorithm

In this section, we present ProbeSim, an index-free algorithm for approximate single-source and top- SimRank queries on large graphs. Recall that an approximate single-source algorithm can be extended to answer the approximate top- queries, by sorting the SimRank estimations and output the top- results. Therefore, the ProbeSim algorithm described in this section focuses on approximate single-source queries with guarantee. Before diving into the details of the algorithm, we first give some high-level ideas of the algorithm.

### 3.1 Rationale

Let and be two -walks from two nodes and , respectively. Let be the -th node in . (Note that .) By Equation 3,

 s(u,v) =Pr[W(u) and W(v) meet] =∑iPr[W(u) and W(v) first meet at ui% ]. (4)

In other words, for a given , if we can estimate the probability that an -walk from first meets at , then we can take the sum of the estimated probabilities over all as an estimation of . Towards this end, a naive approach is to generate a large number of -walks from , and then check the fraction of walks that first meet at . However, if is small, then most of the -walks is “wasted” since they would not meet . To address this issue, our idea is as follows: instead of sampling -walks from each to see if they can “hit” any , we start a graph traversal from each to identify any node that has a non-negligible probability to “walk” to . Intuitively, this significantly reduces the computation cost since it may enable us to omit the nodes whose SimRank similarities to are smaller than a given threshold .

In what follows, we first explain the details of the traversal-based algorithm mentioned above, and then analyze its approximation guarantee and time complexity. For convenience, we formalize the concept of first-meeting probability as follows.

###### Definition 4 (first-meeting probability)

Given a reverse path and a node , , the first-meeting probability of with respect to is defined to be

 P(v,P)=PrW(v)[vi=ui,vi−1≠ui−1,…,v1≠u1],

where is a random -walk that starts at .

Here, the subscript in indicates that the randomness arises from the choices of -walk . In the remainder of the paper, we will omit this subscript when the context is clear.

### 3.2 Basic algorithm

We describe our basic algorithm for the ProbeSim algorithm. Given a node , a sampling error parameter and a failure probability , the algorithm returns , an hash_set of nodes in and their SimRank estimations. For EVERY node , , algorithm 1 returns an estimated SimRank to the actual SimRank with guarantee . Note that the basic algorithm uses unbiased sampling to produce the estimators, thus we can set .

The pseudo-code for the basic ProbeSim algorithm is illustrated in Algorithm 1. The algorithm runs independent trials (Line 1). For the -th trial, the algorithm generates a -walk (Lines 2-3), and invokes the PROBE algorithm on partial -walk for (Lines 5-6). The PROBE algorithm computes a for each node . As we shall see later, is equal to , the first-meeting probability of with respect to partial walk . Let denote the score computed by the PROBE algorithm on partial -walk , for . The algorithm sums up all scores to form the estimator (Lines 7-11).

Finally, for each node , we take the average over the independent estimators to form the final estimator . Note that if we take the average after all trials finish, it would require space to store all the values. Thus, we dynamically update the average estimator for each trial (Lines 12-16). After all trials finishes, we return as the SimRank estimators for each node (Line 17).

Deterministic PROBE Algorithm. We now give a simple deterministic PROBE algorithm for computing the scores in Algorithm 1. Given a partial -walk that starts at , the PROBE algorithm outputs , a hash_set of nodes and their first-meeting probability with respect to reverse path .

The pseudo-code for the algorithm is shown in Algorithm 2. The algorithm initializes hash tables (Line 1) and adds to (Line 2). In the -th iteration, for each node in , the algorithm finds each out-neighbour , and checks if (Lines 3-5). If so, the algorithm adds to (Lines 6-7). Otherwise, it adds to (Lines 8-9). We note that in this iteration, no score is added to , which ensures that the walk avoids at (Line 5).

The intuition of the PROBE algorithm is as follows. For the ease of presentation, we let denote the score computed on the -th iteration for node . One can show that is in fact equal to , the first-meeting probability of each node with respect to reverse path . Consequently, after the -th iteration, the algorithm computes , the first-meeting probability of each node with respect to reverse path . We will make this argument rigorous in the analysis.

Running Example for Algorithm 1 and 2. Throughout the paper, we will use a toy graph in Figure 2 to illustrate our algorithms and pruning rules. For ease of presentation, we set the decay factor so that . The SimRank values of each node to is listed in Table 2, which are computed by the Power Method within error.

Suppose at the -th trial, a random -walk is generated. Figure 2 illustrates the traverse process of the deterministic PROBE algorithm. For simplicity, we only demonstrate the traverse process for partial walk , which is represented by the right-most tree in Figure 2. The algorithm first inserts to . Following the out-edges of , the algorithm finds and omits it as . Next, the algorithm finds , computes , and insert to . Similarly, the algorithm finds and , and inserts and to . For the next iteration, we find , , and from the out-neighbours of , and . Note that is omitted due to the fact that . The score of at this iteration is computed by

Similarly, the algorithm computes , and , and insert , , and to . Finally, for the last iteration, the algorithm computes and , and returns as the results.

To get an estimation from -walk , Algorithm 1 will invoke PROBE for , and . Each probe gives score set: , and . As an example, the estimator is computed by summing up all scores for , which equals to . By summing all scores up from different nodes in , the returned estimation of SimRank scores are and .

### 3.3 Analysis

Time Complexity. We notice that in each iteration of the PROBE algorithm, each edge in the graph is traversed at most once. Thus the time complexity of the PROBE algorithm is , where is the length of the partial -walk . Consequently, the expected time complexity of probing a single -walk in Algorithm 1 is bounded by where is the length of the -walk . We notice that each step in the -walk terminates with probability at least , so is bounded by a geometric distributed random variable with successful probability (here “success” means the termination of the -walk). It follows that

 E[ℓ2] ≤E[X2]=Var(X)+E[X]2=1−pp2+1p2 =2−pp2=1+√c(1−√c)2=O(1).

Therefore, the expected running time of Algorithm 1 on a single -walk is . Summing up for walks follows that the expected running time of Algorithm 1 is bounded by .

Correctness. We now show that Algorithm 1 indeed gives an good estimation to the SimRank values for each , . The following Lemma states that each trial in Algorithm  1 gives an unbiased estimator for the SimRank value .

###### Lemma 1

For any and , Algorithm 1 gives an estimator such that .

We need the following Lemma, which states that if we start a -walk , then the score computed by the PROBE algorithm on partial walk is exactly the probability that and first meet at .

###### Lemma 2

For any node , , after the -iteration, is equal to , the first-meeting probability of with respect to partial -walk .

{proof}

We prove the following claim: Let denote the score of after the -th iteration. Fix a node , . After the -th iteration in Algorithm 2, we have , the first-meeting probability of with respect to reverse path . Recall that

 P(v,(ui−j,…,ui))=PrW(v)[vj+1=ui,vj≠ui−1,…,v1≠ui−j],

where is a random -walk that starts at .

Note that if above claim is true, then after the -th iteration, we have , and the Lemma will follow. We prove the claim by induction. After the -th iteration, we have and for , so the claim holds. Assume the claim holds for the -th iteration. After the -th iteration, for each , , the algorithm set by equation

 Score(v,j+1)=∑x∈I(v)√c|I|⋅Score(x,j). (5)

By the induction hypothesis we have , and thus

 Score(v,j+1)=∑x∈I(v)x≠ui−j+1√c|I|⋅P(x,(ui−j+1,…,ui)) =∑x∈I(v)x≠ui−j+1Pr[v2=x]⋅P(x,(ui−j+1,…,ui)). (6)

Here denotes the probability that selects at the first step. On the other hand, can be expressed the summation of probabilities that first select a node that is not , and then select a reverse -walk from to of length and avoid at -th step. It follows that

 P(v,(ui−j,…,ui)) =∑x∈I(v)x≠ui−j+1\vspace−4mmPr[v2=x]⋅Pr[vj+1=ui,…,v2≠ui−j+1] =∑x∈I(v)x≠ui−j+1\vspace−4mmPr[v2=x]⋅P(x,(ui−j+1,…,ui)). (7)

Combining equations (6) and (7) proves the claim, and the Lemma follows.

With the help of Lemma 2, we can prove Lemma 1: {proof}[of Lemma 1] Let denote the set of all possible -walks that starts at . Fix a -walk . By Lemma 2, the estimated SimRank can be expressed as . Thus we can compute the expectation of this estimation by

 E[~s(u,v)] =∑W(u)∈W(u)Pr[W(u)]⋅ℓ∑i=2P(v,W(u,i)) =∑W(u)∈W(u)ℓ∑i=2Pr[W(u)]⋅P(v,W(u,i)), (8)

where is the probability of walk . Recall that is the probability that a random -walk first meet at . For the ease of presentation, let denote that indicator variable that two -walk and first meet at . In other word, if and first meet at , and if otherwise. We have

 P(v,W(u,i))=∑W(v)∈W(v)Pr[W(v)]⋅I(W(u),W(v),i). (9)

Combining equation (8) and (9), it follows that

 E[~s(u,v)] =∑W(u)∈W(u)W(v)∈W(v)ℓ∑i=2Pr[W(u)]⋅Pr[W(v)]⋅I(W(u),W(v),i) =∑i=2Pr[W(u) and W(v) first meet at i] =Pr[W(u) and W(v) meet]

Note that is the probability that and meet, and the Lemma follows.

By Lemma 1 and Chernoff bound, we have the following Theorem that states by performing independent trials, the error of the estimator provided by Algorithm 1 can be bounded with high probability.

###### Theorem 1

For every node , , Algorithm 1 returns an estimation for such that

We need the following form of Chernoff bound:

###### Lemma 3 (Chernoff Bound [4])

For any set () of i.i.d. random variables with mean and ,

 Pr{∣∣ ∣∣nx∑i=1xi−nxμ∣∣ ∣∣≥nxε}≤exp⎛⎝−nx⋅ε223ε+2μ⎞⎠.
{proof}

[of Theorem 1] We first note that in each trial , the estimator is a value in . It is obvious that . To see that , notice that is a probability. More precisely, it is the probability that a -walk meets with -walk using the same steps.

Thus, the final estimator is the average of i.i.d. random variables whos values lie in the range . Thus, we can apply Chernoff bound:

 Pr[|~s(u,v)−s(u,v)|≥ε]≤exp(−ε2nr/(3s(u,v))).

Recall that , and notice that , it follows that

 Pr[|~s(u,v)−s(u,v)|≥ε]≤exp(−lognδ)=δn.

Taking union bound over all nodes follows that

 Pr[∀v∈V,|~s(u,v)−s(u,v)|≥ε]≤δ,

and the Theorem follows.

## 4 Optimizations

We present three different optimization techniques to speed up our basic ProbeSim algorithm. The pruning rules eliminates unnecessary traversals in the PROBE algorithm, so that a single trial can be performed more efficiently. The batch algorithm builds a reachability tree to maintain all -walks, such that we do not have to perform duplicated PROBE operations in multiple trials. The randomized PROBE algorithm reduces the worst-case time complexity of our algorithm to in expectation.

### 4.1 Pruning

Although the expected steps of a -walks is , we may still find some long walks during a large number of trials. To avoid this overhead, we add the following pruning rule:

###### Pruning Rule 1

Let be the termination parameter to be determined later. In Algorithm 1, truncate all -walks at step .

We explain the intuition of this pruning rule as follows. Let denote the walk, and denote a node on the walk with . For each node , , the probability that a walk meets at is at most , which means that will contribute at most to the SimRank . Summing up over results in an error of . As we shall see in Theorem 2, more elaborated analysis would show that the error contributed by this pruning rule is in fact bounded by . We further notice that it is one-sided error, we can add to each estimator, which will reduce the pruning error by a factor of 2.

The next pruning rule is inspired by the fact that the PROBE algorithm may traverse many vertice with small scores, which can be ignored for the estimation:

###### Pruning Rule 2

Let denote the pruning parameter to be determined later. In Algorithm 2, after computing all in and before descending to , we remove from if .

The intuition of pruning rule 2 is that after more iterations in Algorithm 2, the scores computed from will drop down to . One might think that a node may get multiple error contributions from different pruned nodes; However, the key insight is that the probabilities of the walks from to these nodes sum up to at most , which implies that the error introduced by a single probe is at most . We will make this argument rigorous in Theorem 2.

Running Example for the Pruning Rules. Consider