SLING: A Near-Optimal Index Structure for SimRank
SimRank is a similarity measure for graph nodes that has numerous applications in practice. Scalable SimRank computation has been the subject of extensive research for more than a decade, and yet, none of the existing solutions can efficiently derive SimRank scores on large graphs with provable accuracy guarantees. In particular, the state-of-the-art solution requires up to a few seconds to compute a SimRank score in million-node graphs, and does not offer any worst-case assurance in terms of the query error.
This paper presents SLING, an efficient index structure for SimRank computation. SLING guarantees that each SimRank score returned has at most additive error, and it answers any single-pair and single-source SimRank queries in and time, respectively. These time complexities are near-optimal, and are significantly better than the asymptotic bounds of the most recent approach. Furthermore, SLING requires only space (which is also near-optimal in an asymptotic sense) and pre-computation time, where is the failure probability of the preprocessing algorithm. We experimentally evaluate SLING with a variety of real-world graphs with up to several millions of nodes. Our results demonstrate that SLING is up to times (resp. times) faster than competing methods for single-pair (resp. single-source) SimRank queries, at the cost of higher space overheads.
SLING: A Near-Optimal Index Structure for SimRank
|Shanghai Jiao Tong University|
|Nanyang Technological University|
Assessing the similarity of nodes based on graph topology is an important problem with numerous applications, including social network analysis [?], web mining [?], collaborative filtering [?], natural language processing [?], and spam detection [?]. A number of similarity measures have been proposed, among which SimRank [?] is one of the most well-adopted. The formulation of SimRank is based on two intuitive arguments:
A node should have the maximum similarity to itself;
The similarity between two different nodes can be measured by the average similarity between the two nodes’ neighbors.
Formally, the SimRank score of two nodes and is defined as:
where denotes the set of in-neighbors of a node , and is a decay factor typically set to or [?, ?]. Previous work [?, ?, ?, ?, ?, ?, ?, ?, ?] has applied SimRank (and its variants) to various problem domains, and has demonstrated that it often provides high-quality measurements of node similarity.
Despite of the effectiveness of SimRank, computing SimRank scores efficiently on large graphs is a challenging task, and has been the subject of extensive research for more than a decade. In particular, Jeh and Widom [?] propose the first SimRank algorithm, which returns the SimRank scores of all pairs of nodes in the input graph . The algorithm incurs prohibitive costs: it requires space and time, where and denote the numbers of nodes and edges in , respectively, and is the maximum additive error allowed in any SimRank score. Subsequently, Lizorkin et al. [?] improve the time complexity of the algorithm to , which is further improved to by Yu et al. [?], where . However, the space complexity of the algorithm remains , as is inherent in any algorithm that computes all-pair SimRank scores.
Fogaras and Rácz [?] present the first study on single-pair SimRank computation, and propose a Monte-Carlo method that requires pre-computation time and space. The method returns the SimRank score of any node pair in time, where is the failure probability of the Monte-Carlo method. Subsequently, Li et al. [?] propose a deterministic algorithm for single-pair SimRank queries; it has the same time complexity with Jeh and Widom’s solution [?], but provides much better practical efficiency. However, existing work [?] show that neither Li et al.’s [?] nor Fogaras and Rácz’s solution [?] is able to handle million-node graphs in reasonable time and space. There is a line of research [?, ?, ?, ?, ?, ?] that attempts to mitigate this efficiency issue based on an alternative formulation of SimRank, but the formulation is shown to be incorrect [?], in that it does not return the same SimRank scores as defined in Equation (SLING: A Near-Optimal Index Structure for SimRank).
|Algorithm||Query Time||Space Overhead||Preprocessing Time|
|Single Pair||Single Source|
|Fogaras and Rácz [?]|
|no formal result|
|this paper||(Algorithm 3)|
The most recent approach to SimRank computation is the linearization technique [?] by Maehara et al., which is shown to considerably outperform existing solutions in terms of efficiency and scalability. Nevertheless, it still requires up to a few seconds to answer a single-pair SimRank query on sizable graphs, which is inadequate for large-scale applications. More importantly, the technique is unable to provide any worst-case guarantee in terms of query accuracy. In particular, the technique has a preprocessing step that requires solving a system of linear equations; assuming that the solution to is exact, Maehara et al. [?] show that the technique can ensure worst-case query error, and can answer any single-pair and single-source SimRank queries in and time, respectively. (A single-source SimRank query from a node asks for the SimRank score between and every other node.) Unfortunately, as we discuss in Section SLING: A Near-Optimal Index Structure for SimRank, the linearization technique cannot precisely solve , nor can it offer non-trivial guarantees in terms of the query errors incurred by the imprecision of ’s solution. Consequently, the technique in [?] only provides heuristic solutions to SimRank computation. In summary, after more than tens years of research on SimRank, there is still no solution for efficient SimRank computation on large graphs with provable accuracy guarantees.
This paper presents SLING (SimRank via Local Updates and Sampling), an efficient index structure for SimRank computation. SLING guarantees that each SimRank score returned has at most additive error, and answers any single-pair and single-source SimRank queries in and time, respectively. These time complexities are near-optimal, since any SimRank method requires (resp. ) time to output the result of any single-pair (resp. single-source) query. In addition, they are significantly better than the asymptotic bounds of the states of the art (including Maehara et al.’s technique [?] under their heuristic assumptions), as we show in Table SLING: A Near-Optimal Index Structure for SimRank. Furthermore, SLING requires only space (which is also near-optimal in an asymptotic sense) and pre-computation time, where is the failure probability of the preprocessing algorithm.
Apart from its superior asymptotic bounds, SLING also incorporates several optimization techniques to enhance its practical performance. In particular, we show that its preprocessing algorithm can be improved with a technique that estimates the expectation of a Bernoulli variable using an asymptotically optimal number of samples. Additionally, its space consumption can be heuristically reduced without affecting its theoretical guarantees, while its empirical efficiency for single-source SimRank queries can be considerably improved, at the cost of a slight increase in its query time complexity. Last but not least, its construction algorithms can be easily parallelized, and it can efficiently process queries even when its index structure does not fit in the main memory.
We experimentally evaluate SLING with a variety of real-world graphs with up to several millions of nodes, and show that it significantly outperforms the the states of the art in terms of query efficiency. Specifically, SLING requires at most milliseconds to process a single-pair SimRank query on our datasets, and is up to times faster than the linearization method [?]. To our knowledge, this is the first result in the literature that demonstrates millisecond-scale query time for single-pair SimRank computation on million-node graphs. For single-source SimRank queries, SLING is up to times more efficient than the linearization method. As a tradeoff, SLING incurs larger space overheads than the linearization method, but it is a still much more favorable choice in the common scenario where query time and accuracy (instead of space consumption) are the main concern.
The remainder of the paper is organized as follows. Section SLING: A Near-Optimal Index Structure for SimRank defines the problem that we study. Section \thetable discusses the major existing methods for SimRank computation. Section SLING: A Near-Optimal Index Structure for SimRank presents the SLING index, with a focus on single-pair queries. Section SLING: A Near-Optimal Index Structure for SimRank proposes techniques to optimize the practical performance of SLING. Section SLING: A Near-Optimal Index Structure for SimRank details how SLING supports single-source queries. Section \thefigure experimentally evaluates SLING against the stats of the art
Let be a directed and unweighted graph with nodes and edges. We aim to construct an index structure on to support single-pair and single-source SimRank queries, which are defined as follows:
A single-pair SimRank query takes as input two nodes and in , and returns their SimRank score (see Equation SLING: A Near-Optimal Index Structure for SimRank).
A single-source SimRank query takes as input a node , and returns for each node in .
Following previous work [?, ?, ?, ?], we allow an additive error of at most in each SimRank score returned for any SimRank query.
For ease of exposition, we focus on single-pair SimRank queries in Sections \thetable-SLING: A Near-Optimal Index Structure for SimRank, and then discuss single-source queries in Section SLING: A Near-Optimal Index Structure for SimRank. Table SLING: A Near-Optimal Index Structure for SimRank shows the notations frequently used in the paper. Unless otherwise specified, all logarithms in this paper are to base .
|the input graph|
|the numbers of nodes and edges in|
|the -th node in|
|the set of in-neighbors of a node in|
|the SimRank score of two nodes and in|
|the decay factor in the definition of SimRank|
|the maximum additive error allowed in a SimRank score|
|the failure probability of a Monte-Carlo algorithm|
|the entry on the -th row and -th column of a matrix|
|the correction factor for node|
|the hitting probability (HP) from node to node at step (see Section SLING: A Near-Optimal Index Structure for SimRank)|
This section revisits the three major approaches to SimRank computation: the power method [?], the Monte Carlo method [?], and the linearization method [?, ?, ?, ?]. The asymptotic performance of the Monte Carlo method and the linearization method has been studied in literature, but to our knowledge, there is no formal analysis regarding their space and time complexities when ensuring worst-case errors. We remedy this issue with detailed discussions on each method’s asymptotic bounds and limitations.
The power method [?] is an iterative method for computing the SimRank scores of all pairs of nodes in an input graph. The method uses a matrix , where the element on the -th row and -th column () denotes the SimRank score of the -th node and -th node . Initially, the method sets
After that, in the -th () iteration, the method updates based on the following equation:
Let denote the version of right after the -th iteration. Lizorkin et al. [?] establish the following connection between and the errors in the SimRank scores in :
Lemma 1 ([?])
If , then for any , we have .
Based on Lemma 1 and the fact that each iteration of the power method takes time, we conclude that the power method runs in time when ensuring worst-case error. In addition, it requires space (caused by ). These large complexities in time and space make the power method only applicable on small graphs.
The Monte Carlo method [?] is motivated by an alternative definition of SimRank scores [?] that utilizes the concept of reverse random walks. Given a node in , a reverse random walk from is a sequence of nodes , such that () is selected uniformly at random from the in-neighbors of . We refer to as the -th step of .
Suppose that we have two reverse random walks and that start from two nodes and , respectively, and they first meet at the -th step. That is, the -th steps of and are identical, but for any , the -th step of differs from the -th step of . Jeh and Widom [?] establishes the following connection between and the SimRank score of and :
where denotes the expectation of a random variable.
Based on Equation (2), the Monte Carlo method [?] pre-computes a set of reverse random walks from each node in , such that (i) each set has the same number of walks, and (ii) each walk in is truncated at step , i.e., the nodes after the -th step are omitted. (This truncation is necessary to ensure that the walk is computed efficiently.) Then, given two nodes and , the method estimates their SimRank score as
where denotes the step at which the -th walk in first meets with the -th walk in . Fogaras and Rácz [?] show that, with at least probability,
However, we note that , due to the truncation imposed on the reverse random walks in and . To address this issue, we present the following inequality:
holds for all pairs of and with at least probability. In that case, the space and preprocessing time complexities of the Monte Carlo method are both . In addition, the method takes time to answer a single-pair SimRank query, and time to process a single-source SimRank query. These space and time complexities are rather unfavorable under typical settings of in practice (e.g., ). Fogaras and Rácz [?] alleviate this issue with a coupling technique, which improves the practical performance of the Monte Carlo method in terms of pre-computation time and space consumption. Nevertheless, the method still incurs significant overheads, due to which it is unable to handle graphs with over one million nodes, as we show in Section \thefigure.
Let and be two matrices, with and
Yu et al. [?] show that Equation (SLING: A Near-Optimal Index Structure for SimRank) (i.e., the definition of SimRank) can be rewritten as
where is an identity matrix, is the transpose of , and is the element-wise maximum operator, i.e., for any two matrices and and any .
Maehara et al. [?] point out that solving Equation (6) is difficult since it is a non-linear problem due to the operator. To circumvent this difficulty, they prove that there exists a diagonal matrix (referred to as the diagonal correction matrix), such that
Furthermore, once is given, one can uniquely derive based on the following lemma by Maehara et al. [?]:
Lemma 2 ([?])
Given the diagonal correction matrix ,
where denotes the -th power of .
Given Lemma 2, Maehara et al. [?] propose the linearization method, which pre-computes and then uses it to answer SimRank queries based on Equation (8). In particular, for any two nodes and , Equation (8) leads to
where denotes a -element column vector where the -th element equals and all other elements equal . To avoid the infinite series in Equation (9), the linearization method approximates with
which can be computed in time. It can be shown that if is precise and , then
Therefore, given an exact , the linearization method answers any single-pair SimRank query in time. With a slight modification of Equation 10, the method can also process any single-source SimRank query in time.
Unfortunately, the linearization method do not precisely derive , due to which the above time complexities does not hold in general. Specifically, Maehara et al. [?] formulate as the solution to a linear system, and propose to solve an approximate version of the system to derive an estimation of . However, there is no formal analysis on the errors in and their effects on the accuracy of SimRank computation. In addition, the technique used to solve the approximate linear system does not guarantee to converge, i.e., it may not return in bounded time. Furthermore, even if the technique does converge, its time complexity relies on a parameter that is unknown in advance, and may even dominate , , and . This makes it rather difficult to analyze the pre-computation time of the linearization method. We refer interested readers to Appendix SLING: A Near-Optimal Index Structure for SimRank for detailed discussions on these issues.
In summary, the linearization method by Maehara et al. [?] does not guarantee worst-case error in each SimRank score returned, and there is no non-trivial bound on its preprocessing time. This problem is partially addressed in recent work [?] by Yu and McCann, who propose a variant of the linearization method that does not pre-compute the diagonal correction matrix , but implicitly derives during query processing. Yu and McCann’s technique is able to ensure worst-case error in SimRank computation, but as a trade-off, it requires time to answer a single-pair SimRank query, which renders it inapplicable on any sizable graph.
This section presents our SLING index for SimRank queries. SLING is based on a new interpretation of SimRank scores, which we clarify in Section SLING: A Near-Optimal Index Structure for SimRank. After that, Sections SLING: A Near-Optimal Index Structure for SimRank-7 provide details of SLING and analyze its theoretical guarantees.
Let be the decay factor in the definition of SimRank (see Equation (SLING: A Near-Optimal Index Structure for SimRank)). Suppose that we perform a reverse random walk from any node in , such that
At each step of the walk, we stop with probability;
With the other probability, we inspect the in-neighbors of the node at the current step, and select one of them uniformly at random as the next step.
We refer to such a reverse random walk as a -walk from . In addition, we say that two -walks meet, if for a certain , the -th steps of the two walks are identical. (Note the -th step of a -walk is its starting node.) The following lemma shows an interesting connection between -walks and SimRank.
Let and be two -walks from two nodes and , respectively. Then, equals the probability that and meet.
The above formulation of SimRank is similar in spirit to the one used in the Monte Carlo method [?] (see Section SLING: A Near-Optimal Index Structure for SimRank), but differs in one crucial aspect: each -walk in our formulation has an expected length of , whereas each reverse random walk in the previous formulation is infinite. As a consequence, if we are to estimate using a sample set of -walks from and , we do not need to truncate any -walk for efficiency; in contrast, the Monte Carlo method [?] must trim each reverse random walk to trade estimation accuracy for bounded computation time. In fact, if we incorporate -walks into the Monte Carlo method, then its query time complexities are immediately improved by a factor of . Nonetheless, the space and time overheads of this revised method still leave much room for improvement, since it requires -walks for each node, where is the upper bound on the method’s failure probability. This motivates us to develop the SLING method for more efficient SimRank computation, which we elaborate in the following sections.
Let denote the probability that a -walk from arrives at in its -th step. We refer to as the hitting probability (HP) from to at step . Observe that, for any two -walks and from two nodes and , respectively, the probability that they meet at at the -th step is
Since equals the probability that and meet, one may attempt to compute by taking the the probability that and meet over all combinations of meeting nodes and meeting steps, i.e.,
However, this formulation is incorrect, because the events that “ and meet at node at step ” and “ and meet at node at step ” are not mutually exclusive. For example, assume that , and has only in-neighbor . In that case, and have probability to meet at at the -th step, and a non-zero probability to meet at at the first step. This leads to , whereas by definition.
Interestingly, Equation (12) can be fixed if we substitute with the probability of the event that “ and meet at at step , but never meet again afterwards”. To explain this, observe that the above event indicates that and last meet at at step . If we change (resp. ) in the event, then and should last meet at a different node (resp. step), in which case the changed event and the original one are mutually exclusive. Based on this observation, the following lemma presents a remedy to Equaiton (12).
Let be the probability that two -walks from node do not meet each other after the -th step. Then, for any two nodes and ,
In what follows, we refer to as the correction factor for .
Based on Lemma 4, we propose to pre-compute approximate versions of and HPs , and then use them to estimate SimRank scores based on Equation (4). The immediate problem here is that there exists an infinite number of HPs to approximate, since we need to consider all . However, we observe that if we allow an additive error in the approximate values, then most of the HPs can be estimated as zero and be omitted. In particular, we have the following observation:
For any node and , there exist at most nodes such that .
To understand this, recall that each -walk has only probability to not stop before the -th step, i.e.,
Therefore, at most of the HPs at step can be larger than . Even if we take into account all , the total number of HPs above is only
In other words, we only need to retain a constant number of HPs for each node, if we permit a constant additive error in each HP.
Based on the above analysis, we propose the SLING index, which pre-computes an approximate version of each correction factor , as well as a constant-size set of approximate HPs for each node . To derive the SimRank score of two nodes and , SLING first retrieves , , and , and then estimates in constant time based on an approximate version of Equation (13). The challenge in the design of SLING is threefold. First, how can we derive an accurate estimation of ? Second, how can we efficiently construct without iterating over all HPs? Third, how do we ensure that all and can jointly guarantee worst-case error in each SimRank score computed? In Sections SLING: A Near-Optimal Index Structure for SimRank-7, we elaborate how we address these challenges.
In other words, (resp. ) can be regarded as a random-walk-based interpretation of the entries in (resp. diagonal elements in ). Therefore, Lemmas 2 and 4 are different interpretations of the same result. The main advantage of our new interpretation is that it gives a physical meaning to which, as we show in Section SLING: A Near-Optimal Index Structure for SimRank, enables us to devise a simple and rigorous algorithm to estimate to any desired precision. In contrast, the only existing method for approximating [?] fails to provide any non-trivial guarantees in terms of accuracy and efficiency, as we discuss in Section SLING: A Near-Optimal Index Structure for SimRank.
Let and be two -walks from . By definition, is the probability that any of the following events occurs:
and meet at the first step.
In the first step, and arrive at two different nodes and , respectively; but sometime after the first step, and meet.
Note that the above two events are mutually exclusive, and the first event occurs with probability. For the second event, if we fix a pair of and , then the probability that and meet after the first step equals the probability that a -walk from meets a -walk from ; by Lemma 3, this probability is exactly . Therefore, we have
Equation (14) indicates that, if we are to estimate , it suffices to derive an estimation of
by sampling -walks from and . In particular, as long as is estimated with an error no more than , the resulting estimation of would have at most error. Motivated by this, we propose a sampling method for approximating , as shown in Algorithm 1.
In a nutshell, Algorithm 1 generates pairs of -walks, such that each walk starts from a randomly selected node in ; after that, the algorithm counts the number of pairs that meet at or after the first step; finally, it returns as an estimation of . By the Chernoff bound (see Appendix SLING: A Near-Optimal Index Structure for SimRank) and the properties of -walks, we have the following lemma on the theoretical guarantees of Algorithm 1.
Algorithm 1 runs in expected time, and returns such that holds with at least probability.
As mentioned in Section SLING: A Near-Optimal Index Structure for SimRank, we aim to construct a constant-size set for each node , such that contains an approximate version of each HP that is sufficiently large. Towards this end, a relatively straightforward solution is to sample a set of -walks from each , and then use to derive approximate HPs. This solution, however, requires walks in to ensure that the additive error in each is at most , which leads to considerable computation costs when is small.
Instead of sampling -walks, we devise a deterministic method for constructing all in time while allowing at most additive error in each approximate HP. The key idea of our method is to utilize the following equation on HPs:
for any . Intuitively, Equation (16) indicates that once we have derived the HPs to at step , then we can compute the HPs to at step . Based on this intuition, our method generates approximate HPs to by processing the steps in ascending order of . We note that our method is similar in spirit to the local update algorithm [?, ?, ?] for estimating personalized PageRanks [?], and we refer interested readers to Appendix SLING: A Near-Optimal Index Structure for SimRank for a discussion on the connections between our method and those in [?, ?, ?].
Algorithm 2 shows the pseudo-code of our method. Given and a threshold , the algorithm first initializes for each node (Line 1). After that, for each node , the algorithm performs a graph traversal from to generates approximate HPs from other nodes to . Specifically, for each , it first initializes a set , and then inserts an HP into , which captures the fact that every -walk from has probability to hit itself at the -th step (Lines 3-4). Then, the algorithm enters an iterative process, such that the -th iteration () processes the HPs to at step that have been inserted into .
In particular, in the -the iteration, the algorithm first identifies the approximate HPs in that are at step , and processes each of them in turn (Lines 6-16). If , then it is removed from , i.e., the algorithm omits an approximate HP if it is sufficiently small. Meanwhile, if , then the algorithm inspects each out-neighbor of , and updates the approximate HP from to at step , according to Equation (16). After all approximate HPs at step are processed, the algorithm terminates the iterative process on . Finally, the algorithm inserts each into , after which it proceeds to the next node .
The following lemma states the guarantees of Algorithm 2.
Algorithm 2 runs in time, and constructs a set of approximate HPs for each node , such that . In addition, for each , we have
Given an approximate correction factor and a set of approximate HPs for each node , we estimate the SimRank score between any two nodes and according to a revised version of Equation (13):
Algorithm 3 shows the details of our query processing method.
To analyze the accuracy guarantee of Algorithm 3, we first present a lemma that quantifies the error in based on the errors in and .
Suppose that for any , and
for any . Then, we have if
By Theorem 1, we can ensure worst-case error in each SimRank score by setting , , and . In that case, our SLING index requires pre-computation time and space, and it answers any single-pair SimRank query in time. The space (resp. query time) complexity of SLING is only times larger than the optimal value, since any SimRank method (that ensures worst-case error) requires space for storing the information about all nodes, and takes at least time to output the result of a single-pair SimRank query.
This section presents optimization techniques to (i) improve the efficiency of estimating each correction factors (Section SLING: A Near-Optimal Index Structure for SimRank), (ii) reduce the space consumption of SLING (Section SLING: A Near-Optimal Index Structure for SimRank), (iii) enhance the accuracy of SLING (Section SLING: A Near-Optimal Index Structure for SimRank), and (iv) incorporate parallel and out-of-core computation into SLING’s index construction algorithm (Section SLING: A Near-Optimal Index Structure for SimRank).
As discussed in Section SLING: A Near-Optimal Index Structure for SimRank, Algorithm 1 generates an approximate correction factor in expected time, where is the maximum error allowed in , and is the failure probability. As the algorithm’s time complexity is quadratic to , it is not particularly efficient when is small. This relative inefficiency is caused by the fact the algorithm requires pairs of -walks to estimate the value (in Equation 15) with worst-case error.
However, we observe that we can often use a much smaller number of -walk pairs to derive an estimation of with at most error. Specifically, by the Chernoff bound (see Appendix SLING: A Near-Optimal Index Structure for SimRank), we only need pairs of -walks to estimate . Apparently, this number is much smaller than when (which is often the case in practice). For example, if , then the number of -walk pairs required is only . The main issue here is that we do not know in advance. Nevertheless, if we can derive an upper bound of , and we use it to decide an appropriate number of -walks needed.