# Asynchronous Approximation of a Single Component of the Solution to a Linear System

## Abstract

We present a distributed asynchronous algorithm for approximating a single component of the solution to a system of linear equations Ax=b, where A is a positive definite real matrix and b∈Rn. This can equivalently be formulated as solving for xi in x=Gx+z for some G and z such that the spectral radius of G is less than 1. Our algorithm relies on the Neumann series characterization of the component xi, and is based on residual updates. We analyze our algorithm within the context of a cloud computation model, in which the computation is split into small update tasks performed by small processors with shared access to a distributed file system. We prove a robust asymptotic convergence result when ρ(~G)<1, regardless of the precise order and frequency in which the update tasks are performed, where ~Gij=|Gij|. We provide convergence rate bounds which depend on the order of update tasks performed, analyzing both deterministic update rules via counting weighted random walks, as well as probabilistic update rules via concentration bounds. The probabilistic analysis requires analyzing the product of random matrices which are drawn from distributions that are time and path dependent. We specifically consider the setting where n is large, yet G is sparse, e.g., each row has at most d nonzero entries. This is motivated by applications in which G is derived from the edge structure of an underlying graph. Our results prove that if the local neighborhood of the graph does not grow too quickly as a function of n, our algorithm can provide significant reduction in computation cost as opposed to any algorithm which computes the global solution vector x. Our algorithm obtains an ϵ∥x∥2 additive approximation for xi in constant time with respect to the size of the matrix when d=O(1) and 1/(1−∥G∥2)=O(1).

Christina E. Lee, Asuman Ozdaglar, Devavrat Shah Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139,

, ,

## 1Introduction

Imagine that you are a small restaurant owner in a city. You would like to obtain a quantitative estimate of how your popularity and reputation compare to your competitors within a 5 mile radius of you. You may want to compare the significance of the associated websites of your restaurant and other similar restaurants within the webgraph. This can be measured by PageRank, a quantity used by Google to rank search results. PageRank is defined as the solution to x=α1/n+(1−α)PTx, where P is the adjacency matrix of the webgraph, α is a given parameter, 1 is the vector of all ones, and n is the dimension. Alternatively, you may want to compare the social influence of the restaurants’ associated Facebook pages, which can be computed via the Bonacich centrality. Bonacich centrality is defined as the solution to x=(I−αG)−11, where G is the adjacency matrix of the social network, and α is a given parameter. Both PageRank and Bonacich centrality can be formulated as the solution to a system of linear equations, where the dimension is as large as the webpages in the webgraph or the number of Facebook pages, which is an overwhelming computational expense for our hypothetical small restaurant owner. In this paper, we investigate the question: can we obtain estimates of a few coordinates of the solution vector without the expense of approximating the entire solution vector?

We consider approximating the ith component of the solution to a linear system of equations Ax=b, where A is a positive definite n×n real matrix, and b is a vector in Rn. Positive definite matrices include symmetric diagonally dominant matrices, such as the Laplacian, and also our motivating examples of network centralities, PageRank and Bonacich centrality. Note that A or G may not be symmetric. When A is positive definite, there exists a choice of G and z such that the problem is equivalent to approximating the ith component of the solution to x=Gx+z, and the spectral radius of G, denoted ρ(G), is less than 1. For PageRank, ρ(G) is a constant, bounded by the teleportation probability, independent of the underlying graph. For Bonacich centrality, ρ(G) can be chosen to be less than 1 by a proper choice of the “discount factor” for any graph.

We consider the setting where n is large, yet G is sparse, i.e., the number of nonzero entries in every row of G is at most d. This form of sparsity comes from settings in which the matrix is derived from an underlying bounded degree graph. We will also discuss how we can relax this constraint to other sparse graphs in which the size of the local neighborhood does not grow too quickly.

Solving large systems of linear equations is a problem of great interest due to its relevance to a variety of applications across science and engineering, such as solving large scale optimization problems, approximating solutions to partial differential equations, and modeling network centralities. Due to the large scale of these systems, it becomes useful to have an algorithm which can approximate only a few components of the solution without computing over the entire matrix. Such an algorithm would also lead to efficient ranking and comparison methods. As solving a system of linear equations is fundamentally a problem which involves the full matrix, computing a single component of the solution is non-trivial.

In this era of big data, the classic computation model has changed significantly to accomodate for computation which is too large to compute within a single processor’s memory. We will consider a distributed cloud computation model, in which there are many processors with small constant size memory, yet they have access through the cloud to a distributed file system, e.g. GoogleFS or HDFS, which stores the information regarding the entire matrix. Our algorithm will consist of a sequence of small tasks which can be assigned to different processors to compute asynchronously. We will measure the cost of our algorithm via the amount of computational resources consumed, e.g. number of tasks, DFS accesses, and memory consumed.

### 1.1Contributions

In this paper, we propose and analyze an asynchronous distributed algorithm which provides an estimate ^xi for a single component of the solution vector to x=Gx+z. Our algorithm relies upon the Neumann series representation of the solution to x=Gx+z,

where ei denotes the standard basis vector which takes value 1 at coordinate i and 0 elsewhere. We can interpret the term zT(GT)kei to be the weighted sum of all walks of length k from node i on the underlying graph defined by G. Since we focus on approximating only xi, we can compute the k lower order terms of the summation by summing weighted walks within the k-radius neighborhood of node i, as opposed to the entire graph. This introduces a locality in computation that we can exploit if the neighborhoods of node i do not grow too quickly.

Throughout the paper, we will associate a graph to the matrix G, and we will provide our analysis as a function of various properties of the graph. Let G(G)=(V,E) denote the directed graph where V={1,2,…n}, and (u,v)∈E if and only if Guv≠0. Each coordinate of vector x corresponds to a node in V. Let Nu(t)⊂V denote the nodes within a neighborhood of radius t around node u, i.e. v∈Nu(t) if there exists a path from u to v of length at most t. We will denote the immediate neighbors of node u by Nu, i.e., v∈Nu if Guv≠0. The sparsity assumption on G means that |Nu|≤d for all u.

Our algorithm is an iterative residual based method in which every task corresponds to updating one coordinate of the residual vector. Expression can be rearranged as

where r(t)=(GT)kei are residual vectors which can be computed iteratively, and the estimate is computed as ^x(t)i=zT∑t−1k=0r(k). The synchronous implementation of the algorithm updates the estimate by adding the value of the residual vector in each iteration and updating the residual by multiplying by GT.

The asynchronous implementation of the algorithm updates one coordinate of the residual vector at a time. For example updating coordinate u corresponds to adding ru to ^xi, and multiplying ru by the uth row of G and adding that to the residual vector r. These updates can be interpreted as accumulating weights of walks over the graph, beginning with short length walks. We will prove that every update task maintains an invariant

where ^xi denotes the estimate, and r denotes the residual vector. The invariant property characterizes the error at every iteration, and allows us to prove that the algorithm always converges when ρ(~G)<1, where ~Gij=|Gij|. The convergence result holds regardless of the order in which coordinates are updated, as long as each coordinate is updated infinitely often, and it is robust to asynchronous updates in which the computation corresponding to different tasks may interweave in the order that they update the residual vector in the DFS. The conditions are given in terms of matrix ~G rather than G, since the asynchronous implementation may sum walks of different lengths simultaneously. We use ~Gl to obtain a worst case bound on the total contribution of any set of walks of length longer than some l. We do not require uniform bounds on communication delays or clock rates, as often needed for similar results in asynchronous computation (see [?]).

Since the error is directly related to the residual vector, we can easily design termination conditions that guarantee upper bounds on the estimation error. We propose to terminate when ∥r∥2<ϵ, which guarantees that |^xi−xi|≤ϵ∥x∥2. The sparsity pattern of the residual vector will grow according to an expanding local neighborhood around node i in the graph defined by G, allowing us to upper bound the number of update tasks needed by the computation as a function of the size of this neighborhood.

Our algorithm requires O(n+|E|) space in the distributed file system, and a single update task for requires O(|Nu|) DFS accesses, where u is the coordinate being updated. The convergence rate of our algorithm can be analyzed via the evolution of the residual vector r, which is a function of the particular order, or sequence, of tasks in which the coordinates are updated. We analyze a few different implementations of our algorithm, corresponding to variants on how to choose the order for updating coordinates of the residual vector. We provide our bounds as a function of the maximum degree of the graph, denoted by d, but we can extend the results to other graphs in which we have an upper bound for how the size of the local neighborhood grows.

As a baseline, we compute the cost of a synchronous distributed implementation in which the tasks coordinate between iterations to update the residual vector according to

which involves ∥r∥0 individual coordinate update tasks. We prove that the synchronous implementation attains error less than ϵ∥x∥2 with at most

update tasks. This calculation assumes that the tasks coordinate to keep track of the residual vector in each iteration, which requires extra coordination cost that could lead to delays.

We analyze the asynchronous implementation in which the update tasks do not coordinate different iterations of computation, but rather update the same residual vector r in the DFS. Rather than multiplying by matrix G in each iteration, every individual update task corresponds to applying a local update involving a single row of the matrix G. When the coordinates update sequentially in the order imposed by the expanding local neighborhoods of node i, the convergence rate is very similar to the synchronous implementation, requiring at most

update tasks until the error is less than ϵ∥x∥2. This update rule ensures that first all coordinates in Ni(1) are updated, followed by all coordinates in Ni(2), where the coordinates within the same neighborhood can be updated in any order. The order of updates can be coordinated by a designated master processor which manages a shared task queue for all other processors. This update order ensures that short walks get counted in the estimate earlier. The bound depends on ~G due to using a worst case upper bound for the weight of all walks of length longer than a certain value. Compared to the synchronous implementation, this analysis is weaker when G may have positive or negative entries, since ∥G∥2≤∥~G∥2.

We can alternatively employ randomness to sample the next coordinate to update, enabling every processor to generate the next update task without any coordination cost among other tasks. The algorithm adaptively samples the next coordinate to update according to a distribution which depends on the current residual vector. When the sequence of coordinate updates are sampled uniformly amongst coordinates with nonzero residual values, we can guarantee that with probability at least 1−δ, the error contracts by a time varying factor in each step, and the algorithm involves at most

update tasks until the error is less than ϵ∥x∥2. We term this ‘uniform censored sampling’, since we censored the coordinates according to whether the residual value is nonzero, and we sample uniformly otherwise. Establishing the convergence rate requires bounding the Lyapunov exponent for a product of random matrices drawn from time and path dependent distributions. This is inherently different from previous analyses of randomized coordinate updates, which sample from a history independent distribution. We developed a new analysis for ‘uniform censored sampling’ updates.

We can compare with the bounds for the synchronous implementation by considering that 1−∥G∥2≈−ln(∥G∥2) when ∥G∥2≈1. The randomized update implementation scales exponentially with d, whereas the other two bounds only scale polynomially with d. The gap is due to the fact that the synchronous and deterministic asynchronous implementations update in an order which ensures that short walks are counted earlier. Intuitively, we expect that the weight of the walks decays exponentially due to the weight being a product over values in G which converge eventually to zero. Therefore, by sampling uniformly amongst all coordinates with nonzero residuals, the algorithm may choose to update coordinates which are farther away from node i before it finishes updating coordinates within a closer neighborhood of i. As a result of a single update task, the contributions added in the process corresponding to updates of the residuals along neighbors will will be approximately “exponentially less significant”, and yet the coordinates still carry an equal weight in determining the next coordinate to update. This leads to the exponentially slower convergence as a function of d. This can be remedied by emphasizing coordinates with larger residuals, which we explore heuristically through simulations.

The right hand expressions within the convergence rate bounds across the different implementations are essentially the same, and provide a comparison of our algorithm to standard linear iterative methods, which also converge at the same rate. The left hand expressions provide a local analysis utilizing the sparsity of G. They show that the number of tasks required by our algorithm to reach a specified precision is constant with respect to n as long as d=O(1) and 1/(1−∥G∥2)≈−1/ln(∥G∥2)=O(1). The analysis shows that as long as the local neighborhood does not grow too quickly, i.e., the network is large and sparse enough, and the spectral properties are well behaved, i.e., ∥G∥2 is bounded away from 1, there is a n0 such that for all n≥n0, our algorithm obtains an estimate of xi with fewer computational tasks than any centralized algorithm, by the simple fact that the required tasks of our algorithm is upper bounded by an expression which is independent of n, and any centralized algorithm will scale at least as the size of the solution vector.

### 1.2Related Work

There are few existing methods which have explored single component approximations of the solution vector. Standard techniques such as Gaussian elimination, factorization or decomposition, gradient methods, and linear iterative methods all output the full solution vector, and thus the computation involves all coordinates and all entries in the matrix [?]. There are nearly linear time^{1}

The Ulam von Neumann algorithm is a Monte Carlo method which obtains an estimate for xi by sampling random walks. It interprets the Neumann series representation of the solution x as a sum over weighted walks on G(G), and obtains an estimate by sampling random walks starting from node i over G(G) and appropriately reweighting to obtain an unbiased estimator [?]. The challenge is to control the variance of this estimator, which could be large. The classic choice for the sampling matrix requires ∥G∥∞<1, though there are modifications which propose other sampling matrices or use correlated sampling to reduce the variance [?]. The scope of this algorithm is limited, as Ji, Mascagni, and Li proved that there is a class of matrices such that ρ(G)<1, ∥G∥∞>1, and there does not exist any sampling matrix such that the variance of the corresponding estimator is finite [?]. In contrast, our algorithm exploits the sparsity of G and provides a convergent solution when ρ(G)<1 and convergence rates when ∥G∥2<1. Single component approximation of the leading eigenvector for a stochastic matrix has been studied previously using Monte Carlo random walk sampling methods as well [?].

Our work is most related to stationary linear iterative methods, which use updates of the form xt+1=Gxt+z to recursively approximate leading terms of the Neumann series. The error after t iterations is given by Gt(x−x0), thus the number of iterations to achieve ∥xt−x∥2≤ϵ∥x∥2, is given by ln(ϵ)/ln(∥G∥2). For any t, xt will be at least as dense as z, and there is no reason to assume that z is sparse. Thus a single update step could cost nd multiplications. These methods do not exploit the sparsity of G and the locality of computing a single component.

Bertsekas and Tsitsiklis analyzed the asynchronous implementation of stationary linear iterative methods for solving for the full vector x, where they assign each of n processor to compute updates corresponding to a specific coordinate. They use a different computation model involving a network of distributed processors computing simultaneously, whereas our model involves a shared global memory and variable number of processors computing in parallel. The cost of our algorithm is considered in terms of computational resources consumed, i.e., the number of tasks and DFS accesses, whereas they consider the number of parallel computations until convergence, where each of the n processors are computing at every time step. Our algorithm relies on residual based updates, maintaining an invariant that allows us to precisely characterize the error as a function of the residual vector. These differences lead to very different proof techniques for proving both eventual convergence as well as convergence rate bounds.

Our asynchronous method is closely related to work by Andersen *et al.* which focuses on computing PageRank, and provides an algorithm and analysis which rely on the conditions that G is a nonnegative scaled stochastic matrix, z is entry-wise positive and bounded strictly away from zero, and the solution x is a probability vector (i.e., consisting of nonnegative entries that sum to 1) [?]. In contrast, our algorithm focuses on general linear systems of equations. Moreover, while their analysis proves a linear decrease in the error, we prove that the second moment of our error contracts by a time dependent factor in each iteration, and thus our algorithm converges to the correct solution with a tighter convergence rate. Our algorithm differs from the algorithm presented in Andersen *et al.* by a different choice of termination conditions, and different rules for choosing a coordinate update order, utilizing probabilistic sampling. This not only requires very different analysis, but also allows for the algorithm to be implemented in an asynchronous ditributed manner without coordination between tasks.

The use of randomization in subsampling matrices as part of a subroutine in iterative methods has previously been used in the context of other global matrix algorithms, such as the randomized Kaczmarz method and stochastic iterative projection [?]. The randomized Kaczmarz method is used in the context of solving overdetermined systems of equations, subsampling rows to reduce the dimension of the computation matrix in each iteration. Stochastic iterative methods involve sampling a sparse approximation of matrix G to reduce the computation in each iteration while maintaining convergence. It may be of further interest to combine the method presented in this work with some of these approaches to obtain local algorithms for other settings.

### 1.3Equivalence of Ax=b and x=Gx+z

Given a system of linear equations of the form Ax=b, there are many methods for choosing G and z such that the equation is equivalent to the form given by x=Gx+z with ρ(G)<1 [?]. The Jacobi and Richardson methods are particularly suitable for our setting because they have additional properties that G is as sparse as A, and Gij can be computed as a simple function of Aij and Aii. Given (A,b), there are potentially infinitely many ways to choose (G,z) to satisfy the condition that ρ(G)<1. Finding the optimal choice^{2}

The Jacobi method chooses G=−D−1(A−D) and z=D−1b, where D is a diagonal matrix such that Duu=Auu. The Richardson method chooses G=I−γA and z=γb for any γ such that 0≤γ≤min∥x∥2=1(2xTAx)/(xTATAx). If A is symmetric, then using the Richardson method with an optimal choice of γ results in a choice of G such that ρ(G)=∥G∥2=(κ(A)−1)/(κ(A)+1).

## 2Distributed Computation Model

In the modern world of large scale computation, as the requirement for computational resources and memory storage increases, distributed cloud computing systems have become the norm for computation that involves handling large amounts of data. Since the computing power and memory of any single processor is limited, large distributed file systems (DFS), e.g. HDFS or GFS, collect together many storage disks with a master node which handles I/O requests, allowing clients to access the information in the distributed file system in a similar way of accessing files from the local disk. An algorithm can be parallelized by separating it into small tasks that can each be computed by a single processor through accessing the DFS. The access time of I/O requests to the distributed file system is much longer than accessing files in a processor’s own local memory, so we would like to minimize the number of DFS accesses in addition to the computing resources consumed, i.e. total number of tasks performed.

In this paper, we will assume the computation model as described in Figure 1. There is a large distributed file system, which all the processors have access to. There are a collection of processors (CPUs) each with a small fixed size local memory. One CPU is designated the master CPU, and it manages the task queue as well as the termination and output of the algorithm. The remaining processors are designated worker CPUs, and they perform tasks assigned to them from the task queue. The cost will be counted in terms of the amount of computing resources that the entire computation consumes, e.g. the number of tasks performed, DFS accesses per task, and storage used in the DFS. In many cloud computation systems, the computing resources are shared across many jobs that are running on the cloud, therefore, the tasks queue maybe include tasks corresponding to unrelated jobs as well.

## 3Algorithm Intuition

Given a vector z and matrix G such that ρ(G)<1, our goal is to approximate the ith component of the solution vector x to x=Gx+z. Classic stationary linear iterative methods use updates of the form x(t+1)=Gx(t)+z to iteratively approximate leading terms of the Neumann series. The matrix-vector multiplication can be performed in a distributed manner by splitting it into update tasks, where each task updates a single coordinate u according to

for some u∈{1,2,…n}. These tasks are added to the task queue and assigned to different processors to compute. Since z can be fully dense, the vectors x(t) will be at least as dense as z, thus computing x(t+1) from x(t) involves n individual coordinate update tasks. In our problem, since we are specifically interested in a single component i, we instead define a residual-based update method which maintains sparsity of the intermediate vector involved in the computation. We will first present a synchronous distributed version of the algorithm. In Section 4, we will present the asynchronous distributed implementation of the algorithm, and argue that even when the updates are performed asynchronously, the algorithm still converges to the correct solution. Both implementations require at most O(n+|E|) space in the DFS to store the matrix G, vector z, an any intermediate values involved in the computation.

According to the Neumann series representation of x,

Consider defining the residual vector at iteration k as r(k)=(GT)kei. Observe that the sparsity pattern of (GT)kei is given by Ni(k), the radius k neighborhood of node i. We can rewrite xi as a function of the residual vectors

Let ^x(t)i denote our estimate of xi at iteration t. We can iteratively approximate xi with the low order terms of the Neumann series using the following updates:

and initializing with ^x(0)i=0, and r(0)=ei. Since the sparsity of r(t) is at most the size of the t-radius neighborhood of node i, denoted |Ni(t)|, the computation involved in one iteration can be split into |Ni(t)| single coordinate update tasks, corresponding to updating each coordinate u∈Ni(t). A task updating coordinate u executes the following steps:

ADD Guur(t)u to r(t+1)u,

ADD zur(t)u to ^x(t+1)i,

For each v∈Nu, ADD Guvr(t)u to r(t+1)u.

Each update task uses at most O(|Nu|) DFS accesses, and does not require more than constant space in the local memory. The processor can only store the value of r(t)u, and sequentially access and compute Guvr(t)u for v∈Nu, requiring |Nu| DFS accesses, but only O(1) memory for stored information across computations. We initialize the estimate and residual vectors for the (t+1)th iteration with r(t+1)=0 and ^x(t+1)i=^x(t)i.

The processors still need to pay a synchronization cost due to coordinating the iterations of computation. This results in delays as tasks for a new iteration must wait until every update task from the previous iteration completes. Termination can be determined by imposing a condition on the residual vector which is checked after each iteration of computation, such as terminating when ∥r∥2<ϵ. In Section 5, we will prove convergence rate bounds for the synchronous implementation and discuss the gains the algorithm attains from a coordinate-based computation as opposed to computing the full vector.

## 4Asynchronous Updates

The method described above requires coordination amongst the tasks to track each iteration of the algorithm. This may cause unnecessary delays due to enforcing that the tasks must be completed in a specific order. In this section, we introduce an asynchronous implementation of the algorithm, in which the update tasks may be performed in arbitrary order, and we do not need to wait for previous tasks to complete before beginning to compute a new task. In Section 6, we prove that the algorithm always converges, and we establish convergence rate bounds for different coordinate update rules.

### 4.1Individual Update Tasks

In the asynchronous implementation, since we no longer keep track of any iterations of the algorithm, we will simply store a single instance of the residual vector r in the DFS. When the different tasks update their corresponding coordinates, they will read and write their updates to the residual vector stored on the DFS. The algorithm is initialized in the same way with r=ei and ^xi=0. A task to update residual coordinate u involves three steps:

READ ru, and SET ru to Guuru,

ADD ruzu to ^xi,

For each v∈Nu, ADD Guvru to rv.

The value of ru used in steps 2 and 3 is the original value read from the DFS in step 1. For each task, the worker processor makes O(|Nu|) DFS accesses, and the computation is fairly simple. This is the same computation as the individual tasks in the synchronous implementation, except without keeping track of the residual vector across distinct iterations. As a result, even when the sequence of coordinate updates is the same, the residual vector in the asynchronous implementation will evolve differently. For example, if v∈Nu, and v is updated after u, then in the asynchronous implementation, when the algorithm is updating coordinate v, it will use the updated value in which the previous update from coordinate u added Gvuru to the value of rv. In Section 6.1, we introduce an interpretation of the algorithm as summing weighted walks in the graph. The synchronous implementation always sums the walk in a breadth first manner, such that all walks of length t are summed in the tth iteration, wheras the asynchronous implementation may sum walks of different lengths in a single update.

For the purposes of analyzing the convergence rate bounds, we consider that the three steps involved in a single update task are performed together as a single unit of computation, i.e., that the different steps involved in a single update task are executed together, and do not interleave with other tasks. We let ^x(t)i denote the estimate after t update tasks have completed, and we let r(t) denote the residual vector after t update tasks have completed. This property can be enforced through read and write locks, which would prevent another task from simultaneously changing the value of ru while a particular task is in the middle of computation involving ru. This allows us to clearly track the value of the residual vector after each update task, lending to convergence rate bounds.

However, we will be able to prove asymptotic convergence with much weaker conditions, in which only step 1 of the update task needs to be considered a single unit executed together. Since addition operations are exchangeable, the correctness of the algorithm still holds even when the addition operations in step 2 and 3 of the update task may interleave with other operations on the data from other tasks. Step 1 needs to be executed together because we need to make sure that another task does not add value to ru in between the time that we first read ru and write Guuru, since we would then accidentally override the added value. Alternatively, we would not want another task to read the same value of ru and begin repeating the same update that we have already begun. We will show in Lemma ? that the invariant ^xi−xi=rTx holds before and after any update task.

### 4.2Coordinate Update Rule

In the asynchronous implementation, we are given more flexibility to choose the order in which the coordinates are updated. We could update in the same order as the synchronous implementation, in which we round robin update coordinates according to a breadth first traversal over the graph, i.e., updating first all coordinates in Ni(1), followed by Ni(2). Similarly we can iterate round robin updates for all coordinates with nonzero residual vector values. This can be coordinated by designating one processor as the “master”, whose job is to add tasks to the the task queue.

In settings where we would like to elimination coordination between tasks from a master processr, we can use randomization to generate the tasks or coordinate update order. To approximate the round robin procedure, we could probabilistically choose the next update coordinate by sampling uniformly randomly from all coordinates with nonzero residual values, which we term the ‘uniform censored sampling’ procedure. As each processor finishes a task, it can generate the next task by sampling a new update coordinate. This can be facilitated by storing the value of ∥r∥0 as well as a list of coordinates with nonzero valued residuals, and the update tasks can easily be modified to maintain the value of ∥r∥0 and list of relevant coordinates.

As our algorithm is derived from residual based updates, and the estimation error is given by rTx, this suggests that we may make more progress if we focus on updating coordinates with large residual values. For example, we can choose to always update the coordinate with the largest residual value. This can be implemented by maintaining a priority queue with the residual values. We could also sample a coordinate probabilistically proportional to some function of the residual, e.g., proportional to |ru|, or r2u. This may be more difficult to implement without iterating through the residual vector to generate each sample, though it is still possible to implement in our distributed computation model with a larger number of DFS accesses.

### 4.3Termination

The termination conditions can be chosen depending on the desired accuracy and the value of the residual vector. The error is given by rTx, but since we do not know the value of x, we can design the termination condition as a function of r. For example, terminating when ∥r∥2<ϵ results in an additive error bound of at most ϵ∥x∥2. The individual update tasks can be modified to additionally keep track of ∥r∥22, ∥r∥1, or ∥r∥∞ without incurring much overhead, since these quantities are additive across coordinates, and each update task changes at most d+1 coordinates of r.

We are motivated by network analysis settings in high dimension, such as computing Pagerank or Bonacich centrality when n is large. As n grows to infinity for some large graph, ∥x∥2 is in fact normalized, bounded, and doesn’t scale with n for these three example network centralities. Most of the mass is contained in a few components, implying that an additive error bound of ϵ∥x∥2 approximately guarantees a multiplicative error for large weight nodes, and an additive error for small weight nodes. Therefore, we will present many of our results assuming the algorithm uses a termination condition of ∥r∥2<ϵ.

## 5Synchronous Analysis

In order to compare the convergence rate bounds for the asynchronous implementation, we first analyze the synchronous implementation. We will count the number of tasks and multiplications that the synchronous implementation uses. This analysis will help us to appreciate and identify the gains the algorithm makes due to sparsity and local computation.

Corollary ? follows from the proof of Theorem ?, and highlights that if the graph is sparse, or the size of the neighborhood grows slowly, then the complexity of the algorithm can scale much better than computing the entire solution vector, which would cost O(nln(ϵ)/ln(∥G∥2)) update tasks.

The initial vectors and update rules are chosen such that p(t)=(∑t−1k=0Gk)Tei, and r(t)=(Gt)Tei. Therefore for all t, xi=zTp(t)+xTr(t), and the error in the estimate at iteration t is given by xTr(t)=xT(Gt)Tei. When ρ(G)<1, the error converges to zero, and thus the algorithm converges. It follows that the error is bounded by

When the algorithm terminates at ∥r(t)∥2≤ϵ, the error is bounded by ϵ∥x∥2, and after t iterations, the error is bounded by ∥(GT)tei∥2∥x∥2≤∥G∥t2∥x∥2. Since each row of G has at most d nonzero entries, ∥r(t)∥0≤min(dt,n). The number of coordinate update tasks in each iteration is at most ∥r(t)∥0≤|Ni(k)|≤min(dt,nt). Therefore, we can upper bound the total number of tasks in t iterations by by

Since ∥r(t)∥2 decays as ∥G∥t2, the algorithm terminates at ∥r∥2<ϵ within at most ln(ϵ)/ln(∥G∥2) iterations, upper bounding the tasks performed by

When G is nonsymmetric, ρ(G)<1, yet ∥G∥2≥1, the algorithm still converges asymptotically, though our rate bounds no longer hold. We suspect that in this case, a similar convergence rate holds as a function of ρ(G), due to Gelfand’s spectral radius formula, which states that ρ(M)=limk→∞∥Mk∥1/k2. The precise rate depends on the convergence of this limit.

The right hand expression in the theorem comes from bounding the number of coordinate updates in each iteration by n, which holds even in the nonsparse setting. This bound obtains the same result as standard linear iterative methods. The analysis of our algorithm highlights the improvement of our local algorithm over a general global vector computation. The number of tasks grows as O(∑t−1k=0|Ni(k)|), which for some graphs may be significantly less than O(nt). For a bounded degree graph, ∑tk=0|Ni(k)|=O(dt), which may be much less than O(nt) in the case when d is fixed, and n is very large (recall that t is on the order of ln(ϵ)/ln(∥G∥2)). The bounded degree condition is used in our analysis to cleanly bound |Ni(t)|, however our results naturally extend to other graphs given bounds on |Ni(t)|. For power law graphs, we can use a bound on the growth of the local neighborhood size for average nodes to obtain non-trivial convergence rate results. For graphs in which the size of the neighborhood only grows polynomially, then the local algorithm would gain significant savings over the global algorithm. This results in conditions under which our algorithm achieves an approximation for xi in constant time with respect to the size of the matrix for large n, e.g. d=O(1) and −1/ln(∥G∥2)=O(1).

We can visualize the algorithm in terms of computation over G(G). Multiplying r(t) by GT corresponds to a message passing operation from each of the nonzero coordinates of r(t) along their adjacent edges in the graph. The sparsity of r(t) grows as a breadth first traversal over the graph starting from node i. The termination condition guarantees that the algorithm only involves nodes that are within distance ln(ϵ)/ln(∥G∥2) from the node i. We define the matrix GNi(t) such that GNi(t)(a,b)=G(a,b) if (a,b)∈Ni(t)×Ni(t), and is zero otherwise. It follows that

It is possible that for some choices of i and t, ∥GNi(t)∥2<∥G∥2, in which case the algorithm would converge more quickly as a function of the local neighborhood. If G corresponds to a scaled adjacency matrix of an unweighted undirected graph, then it is known that

In this case, we would only expect ∥GNi(t)∥2 to be smaller than ∥G∥2 if the local degree distribution of the neighborhood around node i is different from the global degree distribution.

## 6Asynchronous Analysis

It is not as straightforward to analyze the asynchronous implementation of the algorithm, since we can no longer write the residual vector as a simple expression of G and the iteration number. However, we can show that each coordinate update task preserves an invariant which relates the estimate ^xi and residual r to the true solution xi.

Recall that x=z+Gx. We prove that the invariant holds by using induction. First verify that before any computation has begun, the invariant is satisfied by the initialized values,

Let rold denote the residual vector before an update task, and let rnew denote the residual vector after an update task. Then a single update task corresponds to the following steps:

Assuming that xi=^xoldi+⟨roldx⟩, it follows that

It follows from Lemma ? that we can choose termination conditions based upon the value of the residual vector which would directly lead to upper bounds on the estimation error. For example, if ∥r∥2≤ϵ, then |^xi−xi|≤ϵ∥x∥2.

### 6.1Counting Weighted Walks

Alternatively, we can take the perspective that the algorithm is computing xi by collecting a sum of weighted walks over the graph G(G) beginning at node i. The estimate ^xi corresponds to the sum of all weighted walks which are already “counted”, and the residual vector represents all yet uncounted walks. As long as step 1 of the coordinate update task is atomic, we can ensure that every walk is accounted for exactly once, either in ^xi, or in the residual vector. Let ~G denote the matrix where ~Gij=|Gij|. Theorem ? uses the perspective of counting weighted walks to show that as long as ρ(~G)<1, the algorithm converges to xi as long as each coordinate is chosen infinitely often, regardless of the sequence in which the updates are performed.

The solution xi can be expressed as a weighted sum over all walks over the graph G(G) beginning at node i, where a walk beginning at node i and ending at node j has weight ∏e∈walkGezj. The updates ensure that we never double count a walk, and all uncounted walks are included in the residual vector r. For any l, there is a finite time Sl after which all random walks of length less than or equal to l have been counted and included into ^xi. This allows us to upper bound |xi−^xi| as a function of ~Gl+1, which converges to zero when ρ(~G)<1. If ρ(~G)≥1, then the original Neumann series stated in is only conditionally convergent. By the Reimann series theorem, the terms can be rearranged in such a way that the new series diverges, and rearranging the terms in the series corresponds to updating the coordinates in different orders, e.g., depth first as opposed to breadth first. This is the same conditions for asymptotic convergence as provided for the asynchronous linear iterative updates, which is also shown to be tight [?]. This theorem and proof can be extended to show that the algorithm converges asymptotically even given communication delays, as long as the messages reach their destination in finite time.

Let Wk(i) denote the set of length k walks beginning from node i, i.e., a sequence of coordinates w=(w0,w1,w2,…wk) such that w0=i and (ws,ws+1)∈E for all s∈{0,1,…k−1}. Then

This shows that due to the form of the weights over each walk, we can express the weighted sum of all walks whose first edge is (i,j) by Gijxj. Similarly, this argument extends recursively to show that the weighted sum of all walks whose first l+1 nodes are given by (v0,v1,v2,…vl) is equivalent to (∏l−1s=0Gvsvs+1)xvl, since xvl captures the sum and weights of the remaining unfinished portion of the walks. In other words, the weighted sum of walks with a certain prefix is equal to the product of the weights from the prefix portion of the walk multiplied with the sum of the weights for all walks that could continue from the endpoint of the prefix walk. The value of the residual vector ru contains the product of the weights along the prefix portion of the walks that end at node u, thus explaining why rTx is equivalent to the weight of all yet uncounted walks.

We use this perspective to interpret the single coordinate updates in the algorithm to argue that there is conservation of computation, i.e., we never double count a walk, and all uncounted walks are included in the residual mass vector r. The update in our algorithm can be interpreted through , as the first term zu captures the walks that end at node u, and each term within the summation Guwxw counts the walks which take the next edge (u,w). Each time that a coordinate u is updated, the step

ADD ruzu to ^xi,

corresponds to counting the walks which end at node u, and whose weight is contained within ru. We also need to count the walks which have the same prefix, but do not yet terminate at node u, which is captured by multiplying ru by each of its adjacent nodes Guw and adding that to rw:

SET ru to Guuru,

For each v∈Nu, ADD Guvru to rv.

We proceed to argue that for any l, there is a finite time t after which all random walks of length less than or equal to l have been counted and included into the estimate ^xi.

Given a coordinate update sequence (u0,u1,u2,u3,…), we require that each coordinate appears infinitely often. We can assume that u0=i, since at iteration 0 all the mass is at node i, thus it is the only node to update. Let S1 denote the earliest time after which all of the neighbors of node i have been updated at least once:

This guarantees that ^x(S1)i includes the weights from all the length one walks from node i. We proceed to let S2 denote the earliest time which all nodes within a 2-neighborhood of node i have updated once after time S1:

This now guarantees that ^x(S2)i includes the weights from all the length one and two walks from node i. We can iteratively define Sr as the time after which the estimate vector has counted and included all walks of length up to r:

Since each coordinate appears infinitely often in the sequence, Sl is well defined and finite for all l.

Finally we upper bound the error by using a loose upper bound on the weights of all walks with length larger than l. By the invariant, it follows that xi=^x(Sl)i+r(Sl)Tx, which is the sum of all counted or included walks in ^x(Sl)i, and the remaining weight of uncounted walks in r(Sl)Tx. The weighted sum of all walks of length at most l from node i is expressed by zT∑lk=0(GT)kei. Thus the error, or weight of uncounted walks, must be bounded by the corresponding weighted sum of the absolute values of the walks of length larger than l:

It follows that

which converges to zero as long as ρ(~G)<1.

In fact our proof for the asymptotic convergence translates directly into a convergence rate bound as well.

Based upon the update sequence we can compute bounds on Sl, or the time after which all walks of length l have definitely been counted. We can analyze the basic coordinate update rule which follows the same pattern as the synchronous implementation, in which we update according to the neighborhoods of i. The update sequence would be given by (Ni(0),Ni(1),Ni(2),…). Although the update order may be the same as the synchronous algorithm, the computation is not the same due to the accumulation of the residual vector across update tasks. It follows that due to the update order, Sl=∑lk=0|Ni(k)|. If the graph is bounded degree with max degree d, such that |Ni(k)|≤dk, then Sl≤dl+1, resulting in the following corollary.

This matches the convergence rate bound for the synchronous algorithm when G is nonnegative, which is reasonable for some applications in which G is derived from network data. It also follows directly from Theorem ? that if every coordinate updates at least once within every B timesteps, then the error decays with rate ∥~G∥t/B2, which is comparable to the bounded delay model and analysis for the asynchronous linear iterative algorithm [?].

### 6.2Probabilistic Update Order

When the coordinates are sampled probabilistically, we can no longer guarantee that a certain set of coordinates are updated within a fixed interval. In this section, we instead provide a probabilistic analysis of the error by analyzing the evolution of the 2-norm of the residual vector in expectation. We will assume that each coordinate update task is atomic, such that if the sequence of coordinate updates is given by (u0,u1,u2,…), the residual vector after t updates will be equivalent to the following computation:

The precise expression depends on the detailed order of updates, and thus the convergence rate may depend upon the rule that the algorithm chooses to determine the order of updating coordinates.

We provide an analysis for ‘uniform censored sampling’, in which coordinates with nonzero valued current residuals are chosen with equal probability, according to

where r(t) denotes the current residual after t updates. We have suppressed the dependence of P on r(t) for simpler notation. Since the distribution P chooses uniformly among the nonzero coordinates of r(t), in expectation the update step corresponds to multiplying a scaled version of matrix G to vector r(t). We can prove that in addition ∥r(t)∥2 contracts with high probability due to the choice of distribution P. With high probability, the number of multiplications the asynchronous algorithm uses is bounded by a similar expression as the bound given for the synchronous algorithm.

Lemma ? states that

Thus, in expectation, the error contracts in each update task by at least (1−(1−∥G∥2)/min(td,n)). We use this to prove an upper bound on the expected L2-norm of the residual vector after t update tasks, and we apply Markov’s inequality to prove that the algorithm terminates with high probability within a certain number of multiplications. There is additional technical detail in the formal proof, as it needs to handle the fact the ∥r(t)∥0 is dependent on the full history of the previous iterations. We first analyze the algorithm for a modified distribution, where the scaling factor grows deterministically according to min(td,n) as opposed to ∥r(t)∥0, and we use a coupling argument to show that the upper bound on the termination time of the algorithm using the modified distribution translates to the original distribution P. This establishes an upper bound on the Lyapunov exponent for a product of random matrices drawn from a time-dependent distribution.

This bound grows exponentially in d, while the corresponding bound in Corollary ? grows only polynomially in d. The rate of convergence of the asynchronous variant is slower by a factor of d/ln(d) because the provable contraction of the error in each iteration is now spread out among the nonzero coordinates of the current iterate.

Since the algorithm terminates when ∥r(t)∥2≤ϵ, it follows from Lemma ? that |^x(t)i−xi|=|r(t)Tx|≤ϵ∥x∥2. Recall that the algorithm chooses a coordinate in each iteration according to the distribution P, as specified in (Equation 3). To simplify the analysis, we introduce another probability distribution ~P, which has a fixed size support of min(td,n) rather than ∥r(t)∥0. We first analyze the convergence of a modified algorithm which samples coordinates according to ~P. Then we translate the results back to the original algorithm.

Observe that for any t∈Z+, there exists a function Ct:Rn→{0,1}n, which satisfies the properties that for any v∈Rn and u=Ct(v), if vi≠0, then ui=1, and if ∥v∥0≤td, then ∥u∥0=min(td,n). In words, Ct(v) is a function which takes a vector of sparsity at most td, and maps it to a binary valued vector which preserves the sparsity pattern of v, yet adds extra entries of 1 in order that the sparsity of the output is exactly min(td,n). We define the distribution ~P to choose uniformly at random among the nonzero coordinates of Ct(r(t)), according to:

where we have suppressed the dependence of ~P on t and r(t) for simpler notation. This is a valid probability distribution since for all t, ∥r(t)∥0≤td. We first analyze the asynchronous algorithm which samples coordinates accoridng to ~P. Lemma ? shows that in expectation, the error contracts in each iteration by (1−(1−∥G∥2)/min(td,n)). Lemma ? provides an upper bound on the expected L2-norm of the residual vector r(t). Then we apply Markov’s inequality to prove that the algorithm terminates with high probability within a certain number of multiplications.

In order to extend the proofs from ~P to P, we define a coupling between two implementations of the algorithm, one which sample coordinates according to ~P, and the other which samples coordinates according to P. We prove that in this joint probability space, the implementation which uses distribution P always terminates in number of iterations less than or equal to the corresponding termination time of the implementation using ~P. Therefore, computing an upper bound on the number of multiplications required under ~P is also an upper bound for the algorithm which uses P.

By Markov’s inequality, P(∥r(t)∥2≥ϵ)≤δ for E~P[∥r(t)∥22]≤δϵ2. Therefore, we can directly apply Lemma ? to show that if ∥G∥2<1, d≥4, and n≥8, the algorithm terminates with probability at least 1−δ for

Since we are concerned with asymptotic performance, the conditions d≥4 and n≥8 are insignificant. To bound the total number of multiplications, we multiply the number of iterations by the maximum degree d.

Finally, we complete the proof by translating the analysis for ~P to P. Let us consider implementation A, which samples coordinates from ~P, and implementation B, which samples coordinates from P. Let RA denote the sequence of residual vectors r(t) derived from implementation A, and let RB denote the sequence of residual vectors r(t) derived from implementation B. The length of the sequence is the number of iterations until the algorithm terminates. We define a joint distribution such that P(RA,RB)=P(RA)P(RB|RA).

Let P(RA) be described by the algorithm sampling coordinates from ~P. The sequence RA can be sampled by separately considering the transitions when non-zero valued coordinates are chosen, and the length of the repeat in between each of these transitions. Given the current iteration t and the sparsity of vector r(t), we can specify the distribution for the number of iterations until the next transition. If we denote τt=min{s:s>t and r(s)≠r(t)}, then

Conditioned on the event that a non-zero valued coordinate is chosen at a particular iteration t, the distribution over the chosen coordinate is the same as P.

For all t, r(t+1)≠r(t) if and only if the algorithm chooses a non-zero valued coordinate of r(t) at iteration t, which according to ~P, occurs with probability 1−∥r(t)∥0/min(td,n). Therefore, given the sequence RA, we can identify in which iterations coordinates with non-zero values were chosen. Let P(RB|RA) be the indicator function which is one only if RB is the subsequence of RA corresponding to the iterations in which a non-zero valued coordinate was chosen.

We can verify that this joint distribution is constructed such that the marginals correctly correspond to the probability of the sequence of residual vectors derived from the respective implementations. For every (RA,RB) such that P(RB|RA)=1, it also follows that |RA|≥|RB|, since RB is a subsequence. For every q,

Therefore, we can conclude that since the probability of the set of realizations such that implementation A terminates within the specified bound is larger than 1−δ, it also follows that implementation B terminates within the specified bound with probability larger than 1−δ. Therefore, since we have proved Theorem ? for implementation A, the result also extends to implementation B, i.e., our original algorithm.

Our convergence rate bounds only apply when the algorithm samples uniformly among nonzero coordinates of the residual vector r(t), according to the distribution P defined by Equation 3. However, this may not be the distribution which optimizes the convergence rate. Choosing a distribution which is not uniform amongst the nonzero coordinates is analogous to multiplying the residual vector by a reweighted matrix ~G in expectation, where each row of ~G corresponds to a row of G weighted by the probability of choosing that row in the new distribution. This is challenging to analyze, as the weights for each row may be dependent upon the entire history of update tasks. Analysis would requires characterizing the Lyapunov exponent of a product of random matrices, where each matrix is sampled from a different distribution, dependent upon the entire history, making it difficult to directly analyze the algorithm in expectation. Under stronger conditions (e.g. only nonnegative entries in G,z), we can show monotonic decrease in the error, and hence bound convergence rates for other sampling distributions with standard techniques.

## 7Simulations

We implemented our algorithm on synthetic data to illustrate the different convergence rates of choosing different coordinate update order rules. Figure ? shows results from computing Bonacich centrality of a node in a Erdos-Renyi graph with 1000 nodes and edge probability 0.0276, which is chosen arbitrarily to guarantee that the graph is sparse yet connected. Figure ? shows results from computing Bonacich centrality of a node in a power law degree graph with 500 nodes, generated according to the configuration model with a power law degree distribution of P(d)∝d−1.5.

We compare the standard linear iterative method, the synchronous implementation of our algorithm, and the asynchronous implementation algorithm with five choices of update rules. Round robin refers to the update rule which follows the expanding neighborhoods of node i, i.e., updating according to the sequence (Ni(0),Ni(1),Ni(2),…). We also implement the uniform censored sampling rule, which samples uniformly amongst nonzero valued residual coordinates. We also explored sampling rules which depend on the value of the residuals, choosing the coordinate proportional to |ru|, r2u, or chosen as argmaxu|ru|.

Our simulations indicate that choosing coordinates with large values of |ru| seems to improve the convergence of the algorithm. In fact the algorithm which always chooses the largest coordinate to update seems to perform the best. This is consistent with our intuition, as the residual vector r is directly related to the estimation error. By updating coordinates with large values of ru, we could make more progress in reducing the error. Establishing theoretical analysis of this observation remains a challenging problem, as the sampling distribution for update task t depends in a complex manner upon the full sample path of updates up to iteration t, as opposed to a simple scaling of the matrix as in uniform sampling. It is not obvious how to establish bounds for a product of random matrices drawn from a complex path dependent distribution.

We also observe that the our algorithm exhibits larger gains in the beginning of the algorithm, but the gains becomes less significant as the algorithm progresses. This could be a result of our algorithm exploiting the small size of the the local neighborhood in the beginning of the algorithm. As the size of the neighborhood grows in include all coordinates, our algorithm no longer enjoys sparse residual vectors, and thus the computational savings slows down.

## 8Future Directions

It is an open problem to investigate the optimal choice of the probability distribution for sampling coordinates used within the asynchronous method, as simulations indicate that some distributions may lead to faster convergence rates than others. This is related to recent work which investigates weighted sampling for the randomized Kaczmarz method, although their methods depend on weighting according to the matrix G rather than the values in the intermediate vectors of the algorithm.

We hope that our work will initiate studies of sparsity-preserving iterative methods in asynchronous and distributed settings from an algorithmic perspective. Our methods could potentially be used as subroutines in other methods in which there is a need for local or asynchronous matrix computation. This is an initial study of local iterative methods for matrix computation, and thus it is unclear whether our algorithm achieves an optimal convergence rate. There is a wealth of literature which studies acceleration or preconditioning techniques for classic linear system solvers, and it would be interesting to see whether these acceleration techniques could be used to speed up the convergence of our local algorithm as a function of the spectral properties of the matrix.

### Footnotes

- O(mlogcnlogϵ−1), where m is the number of nonzero entries in A, and c∈R+ is a fixed constant.
- By optimal, we would like to minimize ρ(G), which maximizes the convergence rate of the algorithm.