SLING: A NearOptimal Index Structure for SimRank
Abstract
SimRank is a similarity measure for graph nodes that has numerous applications in practice. Scalable SimRank computation has been the subject of extensive research for more than a decade, and yet, none of the existing solutions can efficiently derive SimRank scores on large graphs with provable accuracy guarantees. In particular, the stateoftheart solution requires up to a few seconds to compute a SimRank score in millionnode graphs, and does not offer any worstcase assurance in terms of the query error.
This paper presents SLING, an efficient index structure for SimRank computation. SLING guarantees that each SimRank score returned has at most additive error, and it answers any singlepair and singlesource SimRank queries in and time, respectively. These time complexities are nearoptimal, and are significantly better than the asymptotic bounds of the most recent approach. Furthermore, SLING requires only space (which is also nearoptimal in an asymptotic sense) and precomputation time, where is the failure probability of the preprocessing algorithm. We experimentally evaluate SLING with a variety of realworld graphs with up to several millions of nodes. Our results demonstrate that SLING is up to times (resp. times) faster than competing methods for singlepair (resp. singlesource) SimRank queries, at the cost of higher space overheads.
SLING: A NearOptimal Index Structure for SimRank
Boyu Tian 
Shanghai Jiao Tong University 
China 
bytian@umich.edu 
Xiaokui Xiao 
Nanyang Technological University 
Singapore 
xkxiao@ntu.edu.sg 
\@float
copyrightbox[b]
\end@floatAssessing the similarity of nodes based on graph topology is an important problem with numerous applications, including social network analysis [?], web mining [?], collaborative filtering [?], natural language processing [?], and spam detection [?]. A number of similarity measures have been proposed, among which SimRank [?] is one of the most welladopted. The formulation of SimRank is based on two intuitive arguments:

A node should have the maximum similarity to itself;

The similarity between two different nodes can be measured by the average similarity between the two nodes’ neighbors.
Formally, the SimRank score of two nodes and is defined as:
(1) 
where denotes the set of inneighbors of a node , and is a decay factor typically set to or [?, ?]. Previous work [?, ?, ?, ?, ?, ?, ?, ?, ?] has applied SimRank (and its variants) to various problem domains, and has demonstrated that it often provides highquality measurements of node similarity.
Despite of the effectiveness of SimRank, computing SimRank scores efficiently on large graphs is a challenging task, and has been the subject of extensive research for more than a decade. In particular, Jeh and Widom [?] propose the first SimRank algorithm, which returns the SimRank scores of all pairs of nodes in the input graph . The algorithm incurs prohibitive costs: it requires space and time, where and denote the numbers of nodes and edges in , respectively, and is the maximum additive error allowed in any SimRank score. Subsequently, Lizorkin et al. [?] improve the time complexity of the algorithm to , which is further improved to by Yu et al. [?], where . However, the space complexity of the algorithm remains , as is inherent in any algorithm that computes allpair SimRank scores.
Fogaras and Rácz [?] present the first study on singlepair SimRank computation, and propose a MonteCarlo method that requires precomputation time and space. The method returns the SimRank score of any node pair in time, where is the failure probability of the MonteCarlo method. Subsequently, Li et al. [?] propose a deterministic algorithm for singlepair SimRank queries; it has the same time complexity with Jeh and Widom’s solution [?], but provides much better practical efficiency. However, existing work [?] show that neither Li et al.’s [?] nor Fogaras and Rácz’s solution [?] is able to handle millionnode graphs in reasonable time and space. There is a line of research [?, ?, ?, ?, ?, ?] that attempts to mitigate this efficiency issue based on an alternative formulation of SimRank, but the formulation is shown to be incorrect [?], in that it does not return the same SimRank scores as defined in Equation (SLING: A NearOptimal Index Structure for SimRank).
Algorithm  Query Time  Space Overhead  Preprocessing Time  
Single Pair  Single Source  
Fogaras and Rácz [?]  

no formal result  
this paper  (Algorithm 3)  
(Algorithm 6)  
lower bound   
The most recent approach to SimRank computation is the linearization technique [?] by Maehara et al., which is shown to considerably outperform existing solutions in terms of efficiency and scalability. Nevertheless, it still requires up to a few seconds to answer a singlepair SimRank query on sizable graphs, which is inadequate for largescale applications. More importantly, the technique is unable to provide any worstcase guarantee in terms of query accuracy. In particular, the technique has a preprocessing step that requires solving a system of linear equations; assuming that the solution to is exact, Maehara et al. [?] show that the technique can ensure worstcase query error, and can answer any singlepair and singlesource SimRank queries in and time, respectively. (A singlesource SimRank query from a node asks for the SimRank score between and every other node.) Unfortunately, as we discuss in Section SLING: A NearOptimal Index Structure for SimRank, the linearization technique cannot precisely solve , nor can it offer nontrivial guarantees in terms of the query errors incurred by the imprecision of ’s solution. Consequently, the technique in [?] only provides heuristic solutions to SimRank computation. In summary, after more than tens years of research on SimRank, there is still no solution for efficient SimRank computation on large graphs with provable accuracy guarantees.
This paper presents SLING (SimRank via Local Updates and Sampling), an efficient index structure for SimRank computation. SLING guarantees that each SimRank score returned has at most additive error, and answers any singlepair and singlesource SimRank queries in and time, respectively. These time complexities are nearoptimal, since any SimRank method requires (resp. ) time to output the result of any singlepair (resp. singlesource) query. In addition, they are significantly better than the asymptotic bounds of the states of the art (including Maehara et al.’s technique [?] under their heuristic assumptions), as we show in Table SLING: A NearOptimal Index Structure for SimRank. Furthermore, SLING requires only space (which is also nearoptimal in an asymptotic sense) and precomputation time, where is the failure probability of the preprocessing algorithm.
Apart from its superior asymptotic bounds, SLING also incorporates several optimization techniques to enhance its practical performance. In particular, we show that its preprocessing algorithm can be improved with a technique that estimates the expectation of a Bernoulli variable using an asymptotically optimal number of samples. Additionally, its space consumption can be heuristically reduced without affecting its theoretical guarantees, while its empirical efficiency for singlesource SimRank queries can be considerably improved, at the cost of a slight increase in its query time complexity. Last but not least, its construction algorithms can be easily parallelized, and it can efficiently process queries even when its index structure does not fit in the main memory.
We experimentally evaluate SLING with a variety of realworld graphs with up to several millions of nodes, and show that it significantly outperforms the the states of the art in terms of query efficiency. Specifically, SLING requires at most milliseconds to process a singlepair SimRank query on our datasets, and is up to times faster than the linearization method [?]. To our knowledge, this is the first result in the literature that demonstrates millisecondscale query time for singlepair SimRank computation on millionnode graphs. For singlesource SimRank queries, SLING is up to times more efficient than the linearization method. As a tradeoff, SLING incurs larger space overheads than the linearization method, but it is a still much more favorable choice in the common scenario where query time and accuracy (instead of space consumption) are the main concern.
The remainder of the paper is organized as follows. Section SLING: A NearOptimal Index Structure for SimRank defines the problem that we study. Section \thetable discusses the major existing methods for SimRank computation. Section SLING: A NearOptimal Index Structure for SimRank presents the SLING index, with a focus on singlepair queries. Section SLING: A NearOptimal Index Structure for SimRank proposes techniques to optimize the practical performance of SLING. Section SLING: A NearOptimal Index Structure for SimRank details how SLING supports singlesource queries. Section \thefigure experimentally evaluates SLING against the stats of the art
Let be a directed and unweighted graph with nodes and edges. We aim to construct an index structure on to support singlepair and singlesource SimRank queries, which are defined as follows:

A singlepair SimRank query takes as input two nodes and in , and returns their SimRank score (see Equation SLING: A NearOptimal Index Structure for SimRank).

A singlesource SimRank query takes as input a node , and returns for each node in .
Following previous work [?, ?, ?, ?], we allow an additive error of at most in each SimRank score returned for any SimRank query.
For ease of exposition, we focus on singlepair SimRank queries in Sections \thetableSLING: A NearOptimal Index Structure for SimRank, and then discuss singlesource queries in Section SLING: A NearOptimal Index Structure for SimRank. Table SLING: A NearOptimal Index Structure for SimRank shows the notations frequently used in the paper. Unless otherwise specified, all logarithms in this paper are to base .
Notation  Description 
the input graph  
the numbers of nodes and edges in  
the th node in  
the set of inneighbors of a node in  
the SimRank score of two nodes and in  
the decay factor in the definition of SimRank  
the maximum additive error allowed in a SimRank score  
the failure probability of a MonteCarlo algorithm  
the entry on the th row and th column of a matrix  
the correction factor for node  
the hitting probability (HP) from node to node at step (see Section SLING: A NearOptimal Index Structure for SimRank) 
This section revisits the three major approaches to SimRank computation: the power method [?], the Monte Carlo method [?], and the linearization method [?, ?, ?, ?]. The asymptotic performance of the Monte Carlo method and the linearization method has been studied in literature, but to our knowledge, there is no formal analysis regarding their space and time complexities when ensuring worstcase errors. We remedy this issue with detailed discussions on each method’s asymptotic bounds and limitations.
The power method [?] is an iterative method for computing the SimRank scores of all pairs of nodes in an input graph. The method uses a matrix , where the element on the th row and th column () denotes the SimRank score of the th node and th node . Initially, the method sets
After that, in the th () iteration, the method updates based on the following equation:
Let denote the version of right after the th iteration. Lizorkin et al. [?] establish the following connection between and the errors in the SimRank scores in :
Lemma 1 ([?])
If , then for any , we have .
Based on Lemma 1 and the fact that each iteration of the power method takes time, we conclude that the power method runs in time when ensuring worstcase error. In addition, it requires space (caused by ). These large complexities in time and space make the power method only applicable on small graphs.
The Monte Carlo method [?] is motivated by an alternative definition of SimRank scores [?] that utilizes the concept of reverse random walks. Given a node in , a reverse random walk from is a sequence of nodes , such that () is selected uniformly at random from the inneighbors of . We refer to as the th step of .
Suppose that we have two reverse random walks and that start from two nodes and , respectively, and they first meet at the th step. That is, the th steps of and are identical, but for any , the th step of differs from the th step of . Jeh and Widom [?] establishes the following connection between and the SimRank score of and :
(2) 
where denotes the expectation of a random variable.
Based on Equation (2), the Monte Carlo method [?] precomputes a set of reverse random walks from each node in , such that (i) each set has the same number of walks, and (ii) each walk in is truncated at step , i.e., the nodes after the th step are omitted. (This truncation is necessary to ensure that the walk is computed efficiently.) Then, given two nodes and , the method estimates their SimRank score as
where denotes the step at which the th walk in first meets with the th walk in . Fogaras and Rácz [?] show that, with at least probability,
(3) 
However, we note that , due to the truncation imposed on the reverse random walks in and . To address this issue, we present the following inequality:
(4) 
By Equations (3) and (SLING: A NearOptimal Index Structure for SimRank) and the union bound, it can be verified that when and ,
holds for all pairs of and with at least probability. In that case, the space and preprocessing time complexities of the Monte Carlo method are both . In addition, the method takes time to answer a singlepair SimRank query, and time to process a singlesource SimRank query. These space and time complexities are rather unfavorable under typical settings of in practice (e.g., ). Fogaras and Rácz [?] alleviate this issue with a coupling technique, which improves the practical performance of the Monte Carlo method in terms of precomputation time and space consumption. Nevertheless, the method still incurs significant overheads, due to which it is unable to handle graphs with over one million nodes, as we show in Section \thefigure.
Let and be two matrices, with and
(5) 
Yu et al. [?] show that Equation (SLING: A NearOptimal Index Structure for SimRank) (i.e., the definition of SimRank) can be rewritten as
(6) 
where is an identity matrix, is the transpose of , and is the elementwise maximum operator, i.e., for any two matrices and and any .
Maehara et al. [?] point out that solving Equation (6) is difficult since it is a nonlinear problem due to the operator. To circumvent this difficulty, they prove that there exists a diagonal matrix (referred to as the diagonal correction matrix), such that
(7) 
Furthermore, once is given, one can uniquely derive based on the following lemma by Maehara et al. [?]:
Lemma 2 ([?])
Given the diagonal correction matrix ,
(8) 
where denotes the th power of .
Given Lemma 2, Maehara et al. [?] propose the linearization method, which precomputes and then uses it to answer SimRank queries based on Equation (8). In particular, for any two nodes and , Equation (8) leads to
(9) 
where denotes a element column vector where the th element equals and all other elements equal . To avoid the infinite series in Equation (9), the linearization method approximates with
(10) 
which can be computed in time. It can be shown that if is precise and , then
(11) 
Therefore, given an exact , the linearization method answers any singlepair SimRank query in time. With a slight modification of Equation 10, the method can also process any singlesource SimRank query in time.
Unfortunately, the linearization method do not precisely derive , due to which the above time complexities does not hold in general. Specifically, Maehara et al. [?] formulate as the solution to a linear system, and propose to solve an approximate version of the system to derive an estimation of . However, there is no formal analysis on the errors in and their effects on the accuracy of SimRank computation. In addition, the technique used to solve the approximate linear system does not guarantee to converge, i.e., it may not return in bounded time. Furthermore, even if the technique does converge, its time complexity relies on a parameter that is unknown in advance, and may even dominate , , and . This makes it rather difficult to analyze the precomputation time of the linearization method. We refer interested readers to Appendix SLING: A NearOptimal Index Structure for SimRank for detailed discussions on these issues.
In summary, the linearization method by Maehara et al. [?] does not guarantee worstcase error in each SimRank score returned, and there is no nontrivial bound on its preprocessing time. This problem is partially addressed in recent work [?] by Yu and McCann, who propose a variant of the linearization method that does not precompute the diagonal correction matrix , but implicitly derives during query processing. Yu and McCann’s technique is able to ensure worstcase error in SimRank computation, but as a tradeoff, it requires time to answer a singlepair SimRank query, which renders it inapplicable on any sizable graph.
This section presents our SLING index for SimRank queries. SLING is based on a new interpretation of SimRank scores, which we clarify in Section SLING: A NearOptimal Index Structure for SimRank. After that, Sections SLING: A NearOptimal Index Structure for SimRank7 provide details of SLING and analyze its theoretical guarantees.
Let be the decay factor in the definition of SimRank (see Equation (SLING: A NearOptimal Index Structure for SimRank)). Suppose that we perform a reverse random walk from any node in , such that

At each step of the walk, we stop with probability;

With the other probability, we inspect the inneighbors of the node at the current step, and select one of them uniformly at random as the next step.
We refer to such a reverse random walk as a walk from . In addition, we say that two walks meet, if for a certain , the th steps of the two walks are identical. (Note the th step of a walk is its starting node.) The following lemma shows an interesting connection between walks and SimRank.
Lemma 3
Let and be two walks from two nodes and , respectively. Then, equals the probability that and meet.
The above formulation of SimRank is similar in spirit to the one used in the Monte Carlo method [?] (see Section SLING: A NearOptimal Index Structure for SimRank), but differs in one crucial aspect: each walk in our formulation has an expected length of , whereas each reverse random walk in the previous formulation is infinite. As a consequence, if we are to estimate using a sample set of walks from and , we do not need to truncate any walk for efficiency; in contrast, the Monte Carlo method [?] must trim each reverse random walk to trade estimation accuracy for bounded computation time. In fact, if we incorporate walks into the Monte Carlo method, then its query time complexities are immediately improved by a factor of . Nonetheless, the space and time overheads of this revised method still leave much room for improvement, since it requires walks for each node, where is the upper bound on the method’s failure probability. This motivates us to develop the SLING method for more efficient SimRank computation, which we elaborate in the following sections.
Let denote the probability that a walk from arrives at in its th step. We refer to as the hitting probability (HP) from to at step . Observe that, for any two walks and from two nodes and , respectively, the probability that they meet at at the th step is
Since equals the probability that and meet, one may attempt to compute by taking the the probability that and meet over all combinations of meeting nodes and meeting steps, i.e.,
(12) 
However, this formulation is incorrect, because the events that “ and meet at node at step ” and “ and meet at node at step ” are not mutually exclusive. For example, assume that , and has only inneighbor . In that case, and have probability to meet at at the th step, and a nonzero probability to meet at at the first step. This leads to , whereas by definition.
Interestingly, Equation (12) can be fixed if we substitute with the probability of the event that “ and meet at at step , but never meet again afterwards”. To explain this, observe that the above event indicates that and last meet at at step . If we change (resp. ) in the event, then and should last meet at a different node (resp. step), in which case the changed event and the original one are mutually exclusive. Based on this observation, the following lemma presents a remedy to Equaiton (12).
Lemma 4
Let be the probability that two walks from node do not meet each other after the th step. Then, for any two nodes and ,
(13) 
In what follows, we refer to as the correction factor for .
Based on Lemma 4, we propose to precompute approximate versions of and HPs , and then use them to estimate SimRank scores based on Equation (4). The immediate problem here is that there exists an infinite number of HPs to approximate, since we need to consider all . However, we observe that if we allow an additive error in the approximate values, then most of the HPs can be estimated as zero and be omitted. In particular, we have the following observation:
Observation 1
For any node and , there exist at most nodes such that .
To understand this, recall that each walk has only probability to not stop before the th step, i.e.,
Therefore, at most of the HPs at step can be larger than . Even if we take into account all , the total number of HPs above is only
In other words, we only need to retain a constant number of HPs for each node, if we permit a constant additive error in each HP.
Based on the above analysis, we propose the SLING index, which precomputes an approximate version of each correction factor , as well as a constantsize set of approximate HPs for each node . To derive the SimRank score of two nodes and , SLING first retrieves , , and , and then estimates in constant time based on an approximate version of Equation (13). The challenge in the design of SLING is threefold. First, how can we derive an accurate estimation of ? Second, how can we efficiently construct without iterating over all HPs? Third, how do we ensure that all and can jointly guarantee worstcase error in each SimRank score computed? In Sections SLING: A NearOptimal Index Structure for SimRank7, we elaborate how we address these challenges.
Lemma 5
In other words, (resp. ) can be regarded as a randomwalkbased interpretation of the entries in (resp. diagonal elements in ). Therefore, Lemmas 2 and 4 are different interpretations of the same result. The main advantage of our new interpretation is that it gives a physical meaning to which, as we show in Section SLING: A NearOptimal Index Structure for SimRank, enables us to devise a simple and rigorous algorithm to estimate to any desired precision. In contrast, the only existing method for approximating [?] fails to provide any nontrivial guarantees in terms of accuracy and efficiency, as we discuss in Section SLING: A NearOptimal Index Structure for SimRank.
Let and be two walks from . By definition, is the probability that any of the following events occurs:

and meet at the first step.

In the first step, and arrive at two different nodes and , respectively; but sometime after the first step, and meet.
Note that the above two events are mutually exclusive, and the first event occurs with probability. For the second event, if we fix a pair of and , then the probability that and meet after the first step equals the probability that a walk from meets a walk from ; by Lemma 3, this probability is exactly . Therefore, we have
(14) 
Equation (14) indicates that, if we are to estimate , it suffices to derive an estimation of
(15) 
by sampling walks from and . In particular, as long as is estimated with an error no more than , the resulting estimation of would have at most error. Motivated by this, we propose a sampling method for approximating , as shown in Algorithm 1.
In a nutshell, Algorithm 1 generates pairs of walks, such that each walk starts from a randomly selected node in ; after that, the algorithm counts the number of pairs that meet at or after the first step; finally, it returns as an estimation of . By the Chernoff bound (see Appendix SLING: A NearOptimal Index Structure for SimRank) and the properties of walks, we have the following lemma on the theoretical guarantees of Algorithm 1.
Lemma 6
Algorithm 1 runs in expected time, and returns such that holds with at least probability.
As mentioned in Section SLING: A NearOptimal Index Structure for SimRank, we aim to construct a constantsize set for each node , such that contains an approximate version of each HP that is sufficiently large. Towards this end, a relatively straightforward solution is to sample a set of walks from each , and then use to derive approximate HPs. This solution, however, requires walks in to ensure that the additive error in each is at most , which leads to considerable computation costs when is small.
Instead of sampling walks, we devise a deterministic method for constructing all in time while allowing at most additive error in each approximate HP. The key idea of our method is to utilize the following equation on HPs:
(16) 
for any . Intuitively, Equation (16) indicates that once we have derived the HPs to at step , then we can compute the HPs to at step . Based on this intuition, our method generates approximate HPs to by processing the steps in ascending order of . We note that our method is similar in spirit to the local update algorithm [?, ?, ?] for estimating personalized PageRanks [?], and we refer interested readers to Appendix SLING: A NearOptimal Index Structure for SimRank for a discussion on the connections between our method and those in [?, ?, ?].
Algorithm 2 shows the pseudocode of our method. Given and a threshold , the algorithm first initializes for each node (Line 1). After that, for each node , the algorithm performs a graph traversal from to generates approximate HPs from other nodes to . Specifically, for each , it first initializes a set , and then inserts an HP into , which captures the fact that every walk from has probability to hit itself at the th step (Lines 34). Then, the algorithm enters an iterative process, such that the th iteration () processes the HPs to at step that have been inserted into .
In particular, in the the iteration, the algorithm first identifies the approximate HPs in that are at step , and processes each of them in turn (Lines 616). If , then it is removed from , i.e., the algorithm omits an approximate HP if it is sufficiently small. Meanwhile, if , then the algorithm inspects each outneighbor of , and updates the approximate HP from to at step , according to Equation (16). After all approximate HPs at step are processed, the algorithm terminates the iterative process on . Finally, the algorithm inserts each into , after which it proceeds to the next node .
The following lemma states the guarantees of Algorithm 2.
Lemma 7
Algorithm 2 runs in time, and constructs a set of approximate HPs for each node , such that . In addition, for each , we have
Given an approximate correction factor and a set of approximate HPs for each node , we estimate the SimRank score between any two nodes and according to a revised version of Equation (13):
(17) 
Algorithm 3 shows the details of our query processing method.
To analyze the accuracy guarantee of Algorithm 3, we first present a lemma that quantifies the error in based on the errors in and .
Lemma 8
Suppose that for any , and
for any . Then, we have if
Theorem 1
By Theorem 1, we can ensure worstcase error in each SimRank score by setting , , and . In that case, our SLING index requires precomputation time and space, and it answers any singlepair SimRank query in time. The space (resp. query time) complexity of SLING is only times larger than the optimal value, since any SimRank method (that ensures worstcase error) requires space for storing the information about all nodes, and takes at least time to output the result of a singlepair SimRank query.
This section presents optimization techniques to (i) improve the efficiency of estimating each correction factors (Section SLING: A NearOptimal Index Structure for SimRank), (ii) reduce the space consumption of SLING (Section SLING: A NearOptimal Index Structure for SimRank), (iii) enhance the accuracy of SLING (Section SLING: A NearOptimal Index Structure for SimRank), and (iv) incorporate parallel and outofcore computation into SLING’s index construction algorithm (Section SLING: A NearOptimal Index Structure for SimRank).
As discussed in Section SLING: A NearOptimal Index Structure for SimRank, Algorithm 1 generates an approximate correction factor in expected time, where is the maximum error allowed in , and is the failure probability. As the algorithm’s time complexity is quadratic to , it is not particularly efficient when is small. This relative inefficiency is caused by the fact the algorithm requires pairs of walks to estimate the value (in Equation 15) with worstcase error.
However, we observe that we can often use a much smaller number of walk pairs to derive an estimation of with at most error. Specifically, by the Chernoff bound (see Appendix SLING: A NearOptimal Index Structure for SimRank), we only need pairs of walks to estimate . Apparently, this number is much smaller than when (which is often the case in practice). For example, if , then the number of walk pairs required is only . The main issue here is that we do not know in advance. Nevertheless, if we can derive an upper bound of , and we use it to decide an appropriate number of walks needed.