Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs

Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs

Abstract

Given a graph and a node , a single source SimRank query evaluates the similarity between and every node . Existing approaches to single source SimRank computation incur either long query response time, or expensive pre-computation, which needs to be performed again whenever the graph changes. Consequently, to our knowledge none of them is ideal for scenarios in which (i) query processing must be done in realtime, and (ii) the underlying graph is massive, with frequent updates.

Motivated by this, we propose , a novel algorithm that answers single source SimRank queries without any pre-computation, and at the same time achieves significantly higher query processing speed than even the fastest known index-based solutions. Further, provides rigorous result quality guarantees, and its high performance does not rely on any strong assumption of the underlying graph. Specifically, compared to existing methods, employs a radically different algorithmic design that focuses on (i) identifying a small number of nodes relevant to the query, and subsequently (ii) computing statistics and performing residue push from these nodes only.

We prove the correctness of , analyze its time complexity, and compare its asymptotic performance with that of existing methods. Meanwhile, we evaluate the practical performance of through extensive experiments on 9 real datasets. The results demonstrate that consistently outperforms all existing solutions, often by over an order of magnitude. In particular, on a commodity machine, answers a single source SimRank query on a web graph containing over 133 million nodes and 5.4 billion edges in under 62 milliseconds, with 0.00035 empirical error, while the fastest index-based competitor needs 1.18 seconds.

\vldbTitle\vldbAuthors\vldbDOI\vldbVolume\vldbNumber\vldbYear

2020 \newdefdefinitionDefinition

\numberofauthors

1

1 Introduction

Algorithm Query Time Index Size Preprocessing Time
- -
[28]
[12]
[21] - -
[31]
[33]1
Table 1: Comparison of single-source SimRank algorithms with error tolerance and failure probability

SimRank is a popular similarity measure between nodes in a graph, with numerous potential applications, e.g., in recommendation systems [26], schema matching [25], spam detection [2], and graph mining [13, 19, 29]. The main idea of SimRank is that two nodes that are referenced by many similar nodes are themselves similar to each other. For instance, in a social network, two key opinion leaders who are followed by similar fans are expected to be similar in some way, e.g., sharing similar political positions or life experiences. Formally, given a graph and nodes , the SimRank value between and is defined as follows:

where and are the sets of in-neighbors of and , respectively, and is a decay factor commonly fixed to a constant, e.g., [31, 33, 21].

This paper focuses on single-source SimRank processing, which takes as input a node , and computes the SimRank between and every node . This can be applied, for example, in a search engine that retrieves web pages similar to a given one, or in a social networking site that recommends new connections to a user. We focus on online scenarios, in which (i) query execution needs to be done in realtime, and (ii) the underlying graph can change frequently and unpredictably, meaning that query processing must not rely on heavy pre-computions whose results are expensive to update. For large graphs, this problem is highly challenging, since computing SimRank values is immensely expensive: its original definition, presented above, is recursive and requires numerous iterations over the entire graph to converge, which is clearly unscalable.

Several recent approaches, notably [15, 31, 21, 33, 28, 12], have demonstrated promising results for single source SimRank processing, by solving the approximate version of the problem with rigorous result quality guarantees, as elaborated in Section 2. The majority of these methods, however, require extensive pre-processing to index the input graph ; as explained in Section 2.2, such indexes cannot be easily updated when the underlying graph changes, meaning that these methods are not suitable for our target scenarios described above. Specifically, the current state of the art for offline single source SimRank is PRSim [33], which achieves efficient query processing with a relatively lightweight index; nevertheless, it is clearly infeasible to rebuild index for every graph update or new query, as shown in our experiments in Section 5. The current best index-free solution is ProbeSim [33], whose query efficiency is far lower than that of PRSim. Consequently, ProbeSim yields poor response time for large graphs, adversely affecting user experience.

This paper proposes , a novel index-free solution for approximate single source SimRank processing that achieves significantly higher performance compared to all existing solutions (including index-based ones with heavy pre-computation), while providing rigorous quality guarantees. This is achieved through a novel algorithmic design that (i) identifies a small subset of nodes in that are most relevant to the query, called attention nodes, and subsequently (ii) computes important statistics and performs graph traversal starting from attention nodes only. In particular, to ensure -approximate result quality (defined in Section 2.1), it suffices to identify attention nodes. Existing solutions need to perform similar computations on a far larger set of nodes, covering the entire graph in the worst case.

Table 1 compares the asymptotic performance of against several recent approaches, where and denote the number of nodes and edges in , respectively, and and are parameters for the error guarantee. For sparse graphs, is comparable to ; hence, compared to , the complexity of is lower for common values of and . Further, does not involve large hidden constant factors (e.g., as in ), and makes no assumption on the data distribution of the underlying graph (e.g., as in , which assumes that is a power-law graph), as elaborated in Section 2.2.

We experimentally evaluate our method against 6 recent solutions using 9 real graphs. The results demonstrate the high practical performance of . In particular, outperforms all existing methods (both indexed and index-free) in terms of query processing time, and is usually over an order of magnitude faster than the previous best index-free method ProbeSim, on comparable result accuracy levels. Further, on UK graph with million nodes and billion edges, obtains 0.00035 empirical error within milliseconds.

2 Preliminaries

2.1 Problem Definition

Let be a directed graph, where is the set of nodes with cardinality , and is the set of edges with cardinality . If the input graph is undirected, we simply convert each undirected edge to a pair of directed ones and with opposing directions. Following common practice in previous work [33, 21], we define the approximate single-source SimRank query as follows. Table 2 lists frequently used notations in the paper.

{definition}

(Approximate Single Source SimRank Query) Given an input graph , a query node , an absolute error threshold , a failure probability , and decay factor , an approximate single source SimRank query returns an estimated value for the exact SimRank of each node , such that

(1)

holds for any with at least probability.

2.2 State of the Art

SLING [31]. SimRank is well known to be linked to random walks [10]. Earlier work on SimRank processing generally use random walks without decay. More recent approaches are mostly based on a variant called -walks, as follows.

{definition}

(-Walk [31]) Given node and decay factor , -walk from is a random walk that (i) has probability to stop at current node, and (ii) has probability to jump to a random in-neighbor of current node.

Given two -walks from distinct nodes and respectively, we say that these two -walks meet, if they both reach the same node after the same number of steps, say, the -th step. Let be the probability that two -walks from and meet at at the -th step, and never meet again afterwards. Ref. [31] interprets the SimRank value as follows:

(2)

 [31] further decomposes into the product of three probabilities:

(3)

where denotes the probability (called hitting probability) that a -walk from node reaches node at the -th step. Since the random walks starting from nodes and are independent, the product gives the probability (called meeting probability) that these two walks meet at node (called the meeting node). The correction factor (called the last-meeting probability of node ) is the probability that the above two -walks, after meeting at , never meet again in the future. Clearly, this is equivalent to the probability that two independent -walks starting from never meet at any step.

Notation Description
Input graph with nodes and edges
Out-neighbors and in-neighbors of node
Out-degree and in-degree of node
Decay factor in SimRank
, Maximum absolute error and failure probability in approximate SimRank
Error parameter decided by and
Source graph generated for query node
Set of all attention nodes with respect to
Set of attention nodes at the -th level of , where
Max level in
, , Attention nodes at the -th level, -th level, and -th level of respectively, where and
-step hitting probability from to in
-step hitting probability from to in
Approximate hitting probability from to in
Last-meeting probability of attention node at the -th level of
Residue of attention node ,
The probability that two -walks from and meet at at the -th step, and never meet again afterwards.
Table 2: Frequently used notations.

then pre-computes and with error up to , and materializes them in its index. Given a query node , retrieves all nodes at all levels with . Then, for each level and every node on the -th level, retrieves and each node with , and estimates using Equation (3).

incurs substantial pre-processing costs for computing and , which need to be re-computed whenever graph changes, as there is no clear way to efficiently update them. Consequently, is not suitable for online processing. Further, although achieves beautiful asymptotic bounds as shown in Table 1, its practical performance tends to be sub-par due to large hidden constant factors. For instance, Ref. [33] points out that the index size of is over an order of magnitude larger than itself, which leads to high retrieal costs at query time. Our experiments in Section 5 lead to similar conclusions.

PRSim[33]. is based on the main concepts of , and further optimizes performance, especially for power-law graphs. builds a connection between SimRank and personalized PageRank [11]: let be the -hop reverse personalized PageRank (RPPR) between and , we have . uses Equation (4) for SimRank estimation:

(4)

Then, based on the assumption that the input graph is a power-law graph, selects a number of hub nodes, and pre-computes their RPPR values. At query time, estimates by generating -walks from and . If happens to be a hub, seeks the index for all possible for any ; otherwise, is estimated online using a sampling based technique. Finally, estimates based on Equation (4).

Similar to , incurs considerable pre-computation as explained above, and hence, it is not suitable for online SimRank processing. Further, heavily relies on the power-law graph assumption, both in algorithm design and in its asymptotic complexity analysis. In particular, in the best case that the underlying graph strictly follows power-law, the query time complexity is sublinear to the graph size [33]. However, this assumption is rather strong and might be unrealistic: as reported in a recent study [3], strict power-law graphs are rare in practice.

ProbeSim [21]. The state-of-the-art index-free method is . Specifically, let and be two -walks from nodes and , respectively, and be the probability that and first meet at at the -th step. employs Equation (5) to estimate SimRank:

(5)

Given query node , first samples a -walk from . For every node at the -th step of the walk, performs a probing procedure, in order to compute the first meeting probabilities at all levels. In particular, probes nodes in the order of increasing steps, so that when probing at the -th step of , the method excludes the nodes visited in previous probings, in order to compute the first meeting probabilities in Equation (5). Such inefficiency leads to long query response time, which may put off users who wait online for query results.

Other methods.  [12] precomputes -walks and compresses the walks into trees. During query processing, retrieves the walks originating from the query node , and finds all the -walks that meet with the -walks of . [28] builds an index consisting of one-way graphs by sampling one in-neighbor from each node’s in-coming edges. During query processing, the one-way graphs are used to simlulate random walks to estimate SimRank. According to [33], subsumes both READS and TSF; further, [33] points out that the result quality guarantee of is questionable, since (i) allows two walks to meet multiple times, leading to overestimated SimRank values and (ii) assumes that a random walk has no cycles, which may not hold in practice. Finally, [15] is another index-free method, which is subsumed by according to [21]. Meanwhile, according to [21, 33], the result quality guaranee of is problematic as the method truncates random walks with a maximum number of steps.

(a) Source graph and attention sets : attention nodes are in black.
(b) Reverse-Push from
Figure 1: Running Example of

3 Overview of SimPush

We overview the proposed solution in this section, and present the detailed algorithm later in Section 4. As mentioned before, the main idea of is to identify a small set of attention nodes, and focus computations around these nodes only. As we show soon, the number of attention nodes is bounded by , and they are mostly within the close vicinity of the query node , meaning that they can be efficiently identified. Meanwhile, we prove that the error introduced by neglecting non-attention nodes is negligible and bounded within the error guarantee in Inequality (1). This design significantly reduces the computational overhead in .

Specifically, given the input graph and query node , computes the approximate single source SimRank results for in three stages. The first stage identifies the set of attention nodes, denoted as , through a Source-Push algorithm. Besides , Source-Push also returns a graph (referred to as the source graph of ) consisting of nodes in that are visited during the algorithm. In the second stage, follows a similar (and yet much improved) framework as , and computes the hitting probabilities between the query node and each attention node , as well as the last-meeting probability of . Note that in , the computation of hitting probabilities is restricted to attention nodes, and heavily reuses the intermediate results obtained in the first stage, which drastically reduces the computational overhead compared to existing methods such as , which precomputes hitting probabilities for all nodes in a graph by following out-going edges. Further, defines last-meeting probabilities over attention nodes only, and computes the probabilities in a deterministic way over a small source graph generated when computing the attention nodes (details in Section 4.1), while previous methods such as defines its last-meeting probabilities over the whole graph, and precomputes the probabilities by sampling numerous -walks. Finally, in the third stage, employs a Reverse-Push approach to complete the estimates of probabilities between the query node and every node via an attention node , yielding the final estimate of the SimRank between and . In the following, we elaborate on the three stages using the running example in Figure 1.

Discovery of attention nodes. First we clarify what qualifies a node as an attention node of query node .

{definition}

(Attention Nodes on Level ). Given an input graph and a query node , a node is an attention node of on the -th level, if and only if hitting probability where .

Parameter is explained in Lemma 4 towards the end of this subsection. Let denote the set of attention nodes on level , and be the set of all attention nodes that appear in any level. Focusing on the attention nodes only, we employ the interpretation of SimRank in Equation (2), and have the approximate in Equation (6). Lemma 1 provides the error guarantee for 2.

(6)
Lemma 1

Given nodes , their exact SimRank and estimated value in Equation (6) satisfy

In the above definition of , we enumerate all possible levels . Next we show that this is not necessary, since attention nodes only exist in the first few levels within close vicinity of query node , according to the following lemma.

Lemma 2

Given query node , decay factor and parameter , the number of attention nodes with respect to is at most . Meanwhile, all attention nodes exist within steps from .

According to Lemma 2, to discover all attention nodes, it suffices to explore steps around the query node . Further, in , attention node discovery is performed by exploring steps from , through the proposed Source-Push algorithm, detailed in Section 4.1. In particular, Source-Push samples a sufficient number of random walks to determine , such that with high probability (according to parameter ), all attention nodes exist within steps from . The specific value of depends on the input graph . As our experiments demonstrate is usually small for real graphs. For instance, when , on a billion-edge Twitter graph, the average is merely , and the number of attention nodes is no more than a few hundred.

Next, to identify attention nodes, also needs to compute the hitting probabilities from . This is done through a residue propagation procedure in the Source-Push algorithm, detailed in Section 4.1. Specifically, is set to 1, and all other hitting probabilities are initialized to zero. Starting from the -th level, Source-Push pushes hitting probabilities of nodes from the current level to their in-neighbors on the next level, until reaching the -th level. As mentioned earlier, also records the nodes and edges traversed during the propagation in a source graph . Specifically, is organized by levels (with max level ), and there are only edges between adjacent levels, i.e., incoming edges from the -th level to the -th level. itself, as well as the computed hitting probabilities of attention nodes, are reused in subsequent stages of .

Figure 0(a) shows an example of the propagation process, assuming and . Attention (resp. non-attention) nodes are shown as solid circles (resp. empty circles) in the figure. Symbols with a superscript circle (e.g., ) denote non-attention nodes, which are used later in Section 4. Specifically, the propagation starts from and traverses the graph in a level-wise manner, reaching nodes on the first level, nodes on the second level, and nodes on the third level, which is the last level since . Note that a node can be visited multiple times on different levels, e.g., on both the first and third levels. In this case, it is also possible that a node is an attention node on one level (e.g., on Level 1) and non-attention node on another (e.g., on Level 3).

Estimation of . After identifying attention nodes, needs to estimate each , according to Equation (6). Existing solutions mostly estimate it by running numerous -walks on the whole graph , which is costly. Instead, incorporates a novel algorithm that mostly operates within the source graph obtained in the first phase. is far smaller than .

Specifically, the hitting probabilities from to all attention nodes are already obtained Phase 1. Next, we focus on the last meeting probability for a given node . In order to achieve high efficiency, only computes last meeting probabilities for attention nodes, and limits the computations within the source graph . Towards this end, defines a new last meeting probability, as follows.

{definition}

(Last-Meeting Probability in ). Given attention node on the -th level of , where , the last-meeting probability of within , , is the probability that two -walks from and walking within do not meet at any attention node on the -th level within , for .

We emphasize that has vital differences compared to the last-meeting probability used in and , explained in Section 2.2. First, is defined based on the attention sets and source graph , instead of the whole graph. Second, does not take into account whether or not two walks meet at any non-attention node; the rationale here is that non-attention nodes have negligible impact on the SimRank estimation of , and, thus, can be safely ignored. Third, is level-specific and we only consider steps in since there are only incoming edges between consecutive levels in and the levels are bounded by . In Section 4.2, we present an efficient residue-push technique to compute the of all attention nodes, without performing any -walk.

Based on the above notion of last meeting probability, we design another estimate for the SimRank value between the query node and a node , as follows.

(7)

where is the set of attention nodes at the -th level of , obtained in the first phase. Note that here the trivial case of is not considered, and we require .

Compared to defined in Equation (6), uses an estimated , computed using hitting probabilities and last-meeting probabilities in . The following lemma establishes the approximation bound for .

Lemma 3

Given distinct nodes and , their exact SimRank value and estimate satisfy

Reverse-Push. In Equation (7), it remains to clarify the computation of . Instead of estimating independently (e.g., by simulating random walks), we propose a novel Reverse-Push algorithm, detailed in Section 4.3, which estimates as a whole through residue push. Specifically, regards as the initial residue of attention node , and keeps pushing the residue to each node , following out-going edges, until steps are performed.

For example, in Figure 0(b), given a rd level attention node with residue , Reverse-Push propagates the residue to the out-neighbors of , i.e., and , to obtain the residues at the nd level, i.e., and . Then, all residues are pushed to their out-neighbors to get all residues. After that, all are pushed to get residues. It is clear that the nodes at the -th level, e.g., (as well as and ) meets with at in 3 steps. The residue estimates w.r.t., . The detailed push criteria is in Section 4.3.

Stage Time Complexity
Source-Push
All computation
Reverse-Push
Table 3: Complexity of different stages in .

Accordingly, our final SimRank estimate is

(8)

where and are distinct nodes in , is the -th level attention set. Here, the hitting probability from to is hatted to signify that Reverse-Push introduces additional estimation error. Note that as described above, the estimation is over the entire product rather than the last term. Lemma 4 provides error guarantee for , and explains the value of .

Lemma 4

Given distinct nodes and in , error parameter , and decay factor , when , we have .

Note that in Lemma 4, the error bound is deterministic, rather than probabilistic as in our problem definition in Inequality (1). This is due to the fact that in Equation (8), we enumerate up to levels instead of levels as in the actual algorithm, as mentioned earlier. The value of , as well as the probabilistic error bound of the complete solution, are deferred to the next section. Finally, Table 3 lists the time complexity of the three stages of .

4 Detailed SimPush Algorithm

Input: Graph , query node , decay factor , error parameter , failure probability
Output: for each , w.r.t, query node .
1 ;
2 Invoke Algorithm 2 (Source-Push) to obtain attention nodes and the source graph ;
3 Invoke Algorithm 3 to compute all nonzero hitting probabilities for attention nodes in ;
4 for  to  do
5        for each attention node in  do
6               Compute with Algorithm 4;
7               ;
8              
9       
10Invoke Algorithm 5 (Reverse-Push) to get for each ;
11 return for each ;
Algorithm 1

Algorithm 1 shows the main algorithm, consisting of three stages. With set at Line 1, first invokes Source-Push (Section 4.1) to obtain the attention nodes and source graph of (Line 2). Then (Lines 3-7), it computes the of all attention nodes (Section 4.2), and finally invokes Reverse-Push (Section 4.3) to compute the single source SimRank values at Line 8.

4.1 Source-Push

Source-Push first samples a sufficient number of random walks to detect the max level from query node , such that with high probability, all attention nodes appear within steps. Then, it performs residue propagation to compute the hitting probabilities from , in order to identify attention nodes of and generate source graph . Algorithm 2 displays Source-Push. At Lines 1-3, Source-Push first samples -walks from , counts the visits of every node at every -th step, , and then identifies the max level where there exists node with , and is bounded by (Lines 4-8). Then, Algorithm 2 computes the hitting probabilities from for at most levels by propagation (Lines 9-19). Initially, at Lines 9-10, is set to 1, all other hitting probabilities are initialized to zero. Starting from the -th level, Source-Push inserts into frontier set at Line 11, and then for each node in at the current -th level, it pushes and increases the -level hitting probability of every in-neighbor of by and adds edge from to to (Lines 12-16). Then, Source-Push moves to the -th level, and finds all the nodes to push (Lines 17-19). The whole process continues until the -th level is reached or is empty (Line 12). At Lines 20-21, all attention nodes are identified. Lemma 5 states the accuracy guarantees and time complexity of Algorithm 2.

Lemma 5

Algorithm 2 runs in expected time, and with probability at least , contains all nodes with for all levels.

Input: Graph , query , decay factor , parameter
Output: Source graph and attention node sets for .
1 , for and ;
2 for  do
3        Generate a -walk from and for every visited node at the -th step, ;
4       
5;
6 for every nonzero  do
7        if  and  then
8               ;
9              
10       
11;
12 for and each ;
13 ; ;
14 Frontier set ;
15 while  and  do
16        for each  do
17               for each node  do
18                      ;
19                      Insert to the -th level and to the -th level of , and add edge from to in ;
20                     
21              
22       ; ;
23        for each node with  do
24               ;
25              
26       
27for  do
28        Insert in with into ;
29       
Algorithm 2

Lastly, we define hitting probability within , which is an important concept used in the next stages of .

{definition}

(Hitting probability in ). Given nodes and in , the hitting probability from to at the -th step in , is the probability that a -walk from and walking in , visits at the -th step, where .

Hereafter, we use to denote the hitting probabilities in , and use to represent the hitting probabilities in . For query node , every computed by Source-Push over can be reproduced by pushing over , i.e., is the same as . For the ease of presentation, in the following sections, we denote , , and as nodes at the -th, -th, -th levels of respectively, and are attention nodes by default, unless otherwise specified.

4.2 Last-Meeting Correction within

As mentioned, given query with attention sets , computes last-meeting probability for each in the source graph (Definition 3). Utilizing , we design a method that computes for all attention nodes in without generating any -walks, in time. We first clarify the formula to compute , and then present the detailed algorithms.

Formula to compute . Given attention nodes and , we define the -step first-meeting probability in as follows. {definition} (First-meeting probability in ). Given attention nodes and at the -th and -th levels of respectively, where and , is the probability that two -walks from walking in first meet at attention node .

Note that in Definition 4.2, it is allowed that the two walks first meet at some non-attention node in , before meeting at . In this section, when we say that two walks first meet, it means that the two walks first meet at an attention node in . According to Definitions 3 and 4.2, we have

(9)

where is the -th level attention set and . Now, the problem reduces to computing in . This requires the hitting probabilities between attention nodes within (Definition 4.1), to be clarified soon.

When , is nonzero only if attention node is an in-neighbor of in (obviously is at the -th level of ). Given the 1-step hitting probability , the probability of two independent -walks from walking in and meeting at is . Further, since there is only one step from to , is exactly , i.e.,

(10)

where is the -th level attention set. For example, in Figure 2, .

Figure 2: Hitting probabilities in a subgraph of in Figure 0(a).
Figure 3: Non-first-meeting probability from attention nodes to via .

When , we compute by utilizing of the attention nodes between and in , where . Suppose that two -walks from walking in first meet at and then meet at . The non-first-meeting probability from to via is . Figure 3 illustrates this concept, where first-meeting probability is represented by two dashed lines, and meeting probability is represented by one dashed line. Therefore, equals the meeting probability from to , i.e., , subtracted by all the non-first-meeting probabilities from to via any attention node between and , i.e.,

(11)

where . For example, in Figure 2, is not considered since it is a non-attention node.

Hitting probabilities between attention nodes in . Now we focus on computing hitting probabilities in . Given nodes and (here can be a non-attention node), is computed by aggregating the hitting probabilities