1 Introduction
\OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\RUNAUTHOR

Lee, Ozdaglar, and Shah

\RUNTITLE

Asynchronous Approximation of Single Component of Linear System

\TITLE

Asynchronous Approximation of a Single Component
of the Solution to a Linear System

\ARTICLEAUTHORS\AUTHOR

Christina E. Lee, Asuman Ozdaglar, Devavrat Shah \AFFLaboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139,
\EMAILcelee@mit.edu, \EMAILasuman@mit.edu, \EMAILdevavrat@mit.edu

\ABSTRACT

We present a distributed asynchronous algorithm for approximating a single component of the solution to a system of linear equations , where is a positive definite real matrix and . This can equivalently be formulated as solving for in for some and such that the spectral radius of is less than 1. Our algorithm relies on the Neumann series characterization of the component , and is based on residual updates. We analyze our algorithm within the context of a cloud computation model, in which the computation is split into small update tasks performed by small processors with shared access to a distributed file system. We prove a robust asymptotic convergence result when , regardless of the precise order and frequency in which the update tasks are performed, where . We provide convergence rate bounds which depend on the order of update tasks performed, analyzing both deterministic update rules via counting weighted random walks, as well as probabilistic update rules via concentration bounds. The probabilistic analysis requires analyzing the product of random matrices which are drawn from distributions that are time and path dependent. We specifically consider the setting where is large, yet is sparse, e.g., each row has at most nonzero entries. This is motivated by applications in which is derived from the edge structure of an underlying graph. Our results prove that if the local neighborhood of the graph does not grow too quickly as a function of , our algorithm can provide significant reduction in computation cost as opposed to any algorithm which computes the global solution vector . Our algorithm obtains an additive approximation for in constant time with respect to the size of the matrix when and .

\KEYWORDS

linear system of equations, local algorithm, asynchronous randomized algorithm

## 1 Introduction

Imagine that you are a small restaurant owner in a city. You would like to obtain a quantitative estimate of how your popularity and reputation compare to your competitors within a 5 mile radius of you. You may want to compare the significance of the associated websites of your restaurant and other similar restaurants within the webgraph. This can be measured by PageRank, a quantity used by Google to rank search results. PageRank is defined as the solution to , where is the adjacency matrix of the webgraph, is a given parameter, is the vector of all ones, and is the dimension. Alternatively, you may want to compare the social influence of the restaurants’ associated Facebook pages, which can be computed via the Bonacich centrality. Bonacich centrality is defined as the solution to , where is the adjacency matrix of the social network, and is a given parameter. Both PageRank and Bonacich centrality can be formulated as the solution to a system of linear equations, where the dimension is as large as the webpages in the webgraph or the number of Facebook pages, which is an overwhelming computational expense for our hypothetical small restaurant owner. In this paper, we investigate the question: can we obtain estimates of a few coordinates of the solution vector without the expense of approximating the entire solution vector?

We consider approximating the component of the solution to a linear system of equations , where is a positive definite real matrix, and is a vector in . Positive definite matrices include symmetric diagonally dominant matrices, such as the Laplacian, and also our motivating examples of network centralities, PageRank and Bonacich centrality. Note that or may not be symmetric. When is positive definite, there exists a choice of and such that the problem is equivalent to approximating the component of the solution to , and the spectral radius of , denoted , is less than 1. For PageRank, is a constant, bounded by the teleportation probability, independent of the underlying graph. For Bonacich centrality, can be chosen to be less than 1 by a proper choice of the “discount factor” for any graph.

We consider the setting where is large, yet is sparse, i.e., the number of nonzero entries in every row of is at most . This form of sparsity comes from settings in which the matrix is derived from an underlying bounded degree graph. We will also discuss how we can relax this constraint to other sparse graphs in which the size of the local neighborhood does not grow too quickly.

Solving large systems of linear equations is a problem of great interest due to its relevance to a variety of applications across science and engineering, such as solving large scale optimization problems, approximating solutions to partial differential equations, and modeling network centralities. Due to the large scale of these systems, it becomes useful to have an algorithm which can approximate only a few components of the solution without computing over the entire matrix. Such an algorithm would also lead to efficient ranking and comparison methods. As solving a system of linear equations is fundamentally a problem which involves the full matrix, computing a single component of the solution is non-trivial.

In this era of big data, the classic computation model has changed significantly to accomodate for computation which is too large to compute within a single processor’s memory. We will consider a distributed cloud computation model, in which there are many processors with small constant size memory, yet they have access through the cloud to a distributed file system, e.g. GoogleFS or HDFS, which stores the information regarding the entire matrix. Our algorithm will consist of a sequence of small tasks which can be assigned to different processors to compute asynchronously. We will measure the cost of our algorithm via the amount of computational resources consumed, e.g. number of tasks, DFS accesses, and memory consumed.

### 1.1 Contributions

In this paper, we propose and analyze an asynchronous distributed algorithm which provides an estimate for a single component of the solution vector to . Our algorithm relies upon the Neumann series representation of the solution to ,

 xi=eTi∞∑k=0Gkz=zT∞∑k=0(GT)kei, (1)

where denotes the standard basis vector which takes value 1 at coordinate and 0 elsewhere. We can interpret the term to be the weighted sum of all walks of length from node on the underlying graph defined by . Since we focus on approximating only , we can compute the lower order terms of the summation by summing weighted walks within the -radius neighborhood of node , as opposed to the entire graph. This introduces a locality in computation that we can exploit if the neighborhoods of node do not grow too quickly.

Throughout the paper, we will associate a graph to the matrix , and we will provide our analysis as a function of various properties of the graph. Let denote the directed graph where , and if and only if . Each coordinate of vector corresponds to a node in . Let denote the nodes within a neighborhood of radius around node , i.e. if there exists a path from to of length at most . We will denote the immediate neighbors of node by , i.e., if . The sparsity assumption on means that for all .

Our algorithm is an iterative residual based method in which every task corresponds to updating one coordinate of the residual vector. Expression (1) can be rearranged as

 xi=zTt−1∑k=0r(k)+xTr(t),

where are residual vectors which can be computed iteratively, and the estimate is computed as . The synchronous implementation of the algorithm updates the estimate by adding the value of the residual vector in each iteration and updating the residual by multiplying by .

The asynchronous implementation of the algorithm updates one coordinate of the residual vector at a time. For example updating coordinate corresponds to adding to , and multiplying by the row of and adding that to the residual vector . These updates can be interpreted as accumulating weights of walks over the graph, beginning with short length walks. We will prove that every update task maintains an invariant

 xi=^xi+rTx,

where denotes the estimate, and denotes the residual vector. The invariant property characterizes the error at every iteration, and allows us to prove that the algorithm always converges when , where . The convergence result holds regardless of the order in which coordinates are updated, as long as each coordinate is updated infinitely often, and it is robust to asynchronous updates in which the computation corresponding to different tasks may interweave in the order that they update the residual vector in the DFS. The conditions are given in terms of matrix rather than , since the asynchronous implementation may sum walks of different lengths simultaneously. We use to obtain a worst case bound on the total contribution of any set of walks of length longer than some . We do not require uniform bounds on communication delays or clock rates, as often needed for similar results in asynchronous computation (see Bertsekas and Tsitsiklis (1989)).

Since the error is directly related to the residual vector, we can easily design termination conditions that guarantee upper bounds on the estimation error. We propose to terminate when , which guarantees that . The sparsity pattern of the residual vector will grow according to an expanding local neighborhood around node in the graph defined by , allowing us to upper bound the number of update tasks needed by the computation as a function of the size of this neighborhood.

Our algorithm requires space in the distributed file system, and a single update task for requires DFS accesses, where is the coordinate being updated. The convergence rate of our algorithm can be analyzed via the evolution of the residual vector , which is a function of the particular order, or sequence, of tasks in which the coordinates are updated. We analyze a few different implementations of our algorithm, corresponding to variants on how to choose the order for updating coordinates of the residual vector. We provide our bounds as a function of the maximum degree of the graph, denoted by , but we can extend the results to other graphs in which we have an upper bound for how the size of the local neighborhood grows.

As a baseline, we compute the cost of a synchronous distributed implementation in which the tasks coordinate between iterations to update the residual vector according to

 r←Gr,

which involves individual coordinate update tasks. We prove that the synchronous implementation attains error less than with at most

 O(min(ϵln(d)/ln(∥G∥2),nln(ϵ)/ln(∥G∥2)))

update tasks. This calculation assumes that the tasks coordinate to keep track of the residual vector in each iteration, which requires extra coordination cost that could lead to delays.

We analyze the asynchronous implementation in which the update tasks do not coordinate different iterations of computation, but rather update the same residual vector in the DFS. Rather than multiplying by matrix in each iteration, every individual update task corresponds to applying a local update involving a single row of the matrix . When the coordinates update sequentially in the order imposed by the expanding local neighborhoods of node , the convergence rate is very similar to the synchronous implementation, requiring at most

 O(min(ϵln(d)/ln(∥~G∥2),nln(ϵ)/ln(∥~G∥2)))

update tasks until the error is less than . This update rule ensures that first all coordinates in are updated, followed by all coordinates in , where the coordinates within the same neighborhood can be updated in any order. The order of updates can be coordinated by a designated master processor which manages a shared task queue for all other processors. This update order ensures that short walks get counted in the estimate earlier. The bound depends on due to using a worst case upper bound for the weight of all walks of length longer than a certain value. Compared to the synchronous implementation, this analysis is weaker when may have positive or negative entries, since .

We can alternatively employ randomness to sample the next coordinate to update, enabling every processor to generate the next update task without any coordination cost among other tasks. The algorithm adaptively samples the next coordinate to update according to a distribution which depends on the current residual vector. When the sequence of coordinate updates are sampled uniformly amongst coordinates with nonzero residual values, we can guarantee that with probability at least , the error contracts by a time varying factor in each step, and the algorithm involves at most

 O(min((ϵ√δ/5)−d/(1−∥G∥2),−nln(ϵ√δ)/(1−∥G∥2)))

update tasks until the error is less than . We term this ‘uniform censored sampling’, since we censored the coordinates according to whether the residual value is nonzero, and we sample uniformly otherwise. Establishing the convergence rate requires bounding the Lyapunov exponent for a product of random matrices drawn from time and path dependent distributions. This is inherently different from previous analyses of randomized coordinate updates, which sample from a history independent distribution. We developed a new analysis for ‘uniform censored sampling’ updates.

We can compare with the bounds for the synchronous implementation by considering that when . The randomized update implementation scales exponentially with , whereas the other two bounds only scale polynomially with . The gap is due to the fact that the synchronous and deterministic asynchronous implementations update in an order which ensures that short walks are counted earlier. Intuitively, we expect that the weight of the walks decays exponentially due to the weight being a product over values in which converge eventually to zero. Therefore, by sampling uniformly amongst all coordinates with nonzero residuals, the algorithm may choose to update coordinates which are farther away from node before it finishes updating coordinates within a closer neighborhood of . As a result of a single update task, the contributions added in the process corresponding to updates of the residuals along neighbors will will be approximately “exponentially less significant”, and yet the coordinates still carry an equal weight in determining the next coordinate to update. This leads to the exponentially slower convergence as a function of . This can be remedied by emphasizing coordinates with larger residuals, which we explore heuristically through simulations.

The right hand expressions within the convergence rate bounds across the different implementations are essentially the same, and provide a comparison of our algorithm to standard linear iterative methods, which also converge at the same rate. The left hand expressions provide a local analysis utilizing the sparsity of . They show that the number of tasks required by our algorithm to reach a specified precision is constant with respect to as long as and . The analysis shows that as long as the local neighborhood does not grow too quickly, i.e., the network is large and sparse enough, and the spectral properties are well behaved, i.e., is bounded away from 1, there is a such that for all , our algorithm obtains an estimate of with fewer computational tasks than any centralized algorithm, by the simple fact that the required tasks of our algorithm is upper bounded by an expression which is independent of , and any centralized algorithm will scale at least as the size of the solution vector.

### 1.2 Related Work

There are few existing methods which have explored single component approximations of the solution vector. Standard techniques such as Gaussian elimination, factorization or decomposition, gradient methods, and linear iterative methods all output the full solution vector, and thus the computation involves all coordinates and all entries in the matrix Westlake (1968), Golub and Van Loan (2013). There are nearly linear time1 approximation algorithms for sparse and symmetric diagonally dominant matrices , however they involve global structures over the graph, such as graph sparsifiers or spanning trees Spielman and Teng (2006), Koutis et al. (2011), Kelner et al. (2013).

The Ulam von Neumann algorithm is a Monte Carlo method which obtains an estimate for by sampling random walks. It interprets the Neumann series representation of the solution as a sum over weighted walks on , and obtains an estimate by sampling random walks starting from node over and appropriately reweighting to obtain an unbiased estimator Forsythe and Leibler (1950), Wasow (1952), Curtiss (1954). The challenge is to control the variance of this estimator, which could be large. The classic choice for the sampling matrix requires , though there are modifications which propose other sampling matrices or use correlated sampling to reduce the variance Halton (1970, 1994). The scope of this algorithm is limited, as Ji, Mascagni, and Li proved that there is a class of matrices such that , , and there does not exist any sampling matrix such that the variance of the corresponding estimator is finite Ji et al. (2013). In contrast, our algorithm exploits the sparsity of and provides a convergent solution when and convergence rates when . Single component approximation of the leading eigenvector for a stochastic matrix has been studied previously using Monte Carlo random walk sampling methods as well Lee et al. (2013).

Our work is most related to stationary linear iterative methods, which use updates of the form to recursively approximate leading terms of the Neumann series. The error after iterations is given by , thus the number of iterations to achieve , is given by . For any , will be at least as dense as , and there is no reason to assume that is sparse. Thus a single update step could cost multiplications. These methods do not exploit the sparsity of and the locality of computing a single component.

Bertsekas and Tsitsiklis analyzed the asynchronous implementation of stationary linear iterative methods for solving for the full vector , where they assign each of processor to compute updates corresponding to a specific coordinate. They use a different computation model involving a network of distributed processors computing simultaneously, whereas our model involves a shared global memory and variable number of processors computing in parallel. The cost of our algorithm is considered in terms of computational resources consumed, i.e., the number of tasks and DFS accesses, whereas they consider the number of parallel computations until convergence, where each of the processors are computing at every time step. Our algorithm relies on residual based updates, maintaining an invariant that allows us to precisely characterize the error as a function of the residual vector. These differences lead to very different proof techniques for proving both eventual convergence as well as convergence rate bounds.

Our asynchronous method is closely related to work by Andersen et al. which focuses on computing PageRank, and provides an algorithm and analysis which rely on the conditions that is a nonnegative scaled stochastic matrix, is entry-wise positive and bounded strictly away from zero, and the solution is a probability vector (i.e., consisting of nonnegative entries that sum to 1) Andersen et al. (2007). In contrast, our algorithm focuses on general linear systems of equations. Moreover, while their analysis proves a linear decrease in the error, we prove that the second moment of our error contracts by a time dependent factor in each iteration, and thus our algorithm converges to the correct solution with a tighter convergence rate. Our algorithm differs from the algorithm presented in Andersen et al. by a different choice of termination conditions, and different rules for choosing a coordinate update order, utilizing probabilistic sampling. This not only requires very different analysis, but also allows for the algorithm to be implemented in an asynchronous ditributed manner without coordination between tasks.

The use of randomization in subsampling matrices as part of a subroutine in iterative methods has previously been used in the context of other global matrix algorithms, such as the randomized Kaczmarz method and stochastic iterative projection Strohmer and Vershynin (2009), Sabelfeld and Loshchina (2010), Sabelfeld (2011), Sabelfeld and Mozartova (2009), Wang and Bertsekas (2013). The randomized Kaczmarz method is used in the context of solving overdetermined systems of equations, subsampling rows to reduce the dimension of the computation matrix in each iteration. Stochastic iterative methods involve sampling a sparse approximation of matrix to reduce the computation in each iteration while maintaining convergence. It may be of further interest to combine the method presented in this work with some of these approaches to obtain local algorithms for other settings.

### 1.3 Equivalence of Ax=b and x=Gx+z

Given a system of linear equations of the form , there are many methods for choosing and such that the equation is equivalent to the form given by with Westlake (1968). The Jacobi and Richardson methods are particularly suitable for our setting because they have additional properties that is as sparse as , and can be computed as a simple function of and . Given , there are potentially infinitely many ways to choose to satisfy the condition that . Finding the optimal choice2 of given is beyond the scope of this paper.

{corollary}

If is positive definite or diagonally dominant, then we can use standard methods (e.g. Jacobi or Richardson), to choose a matrix and vector such that , and the solution which satisfies will also satisfy .

The Jacobi method chooses and , where is a diagonal matrix such that . The Richardson method chooses and for any such that . If is symmetric, then using the Richardson method with an optimal choice of results in a choice of such that .

## 2 Distributed Computation Model

In the modern world of large scale computation, as the requirement for computational resources and memory storage increases, distributed cloud computing systems have become the norm for computation that involves handling large amounts of data. Since the computing power and memory of any single processor is limited, large distributed file systems (DFS), e.g. HDFS or GFS, collect together many storage disks with a master node which handles I/O requests, allowing clients to access the information in the distributed file system in a similar way of accessing files from the local disk. An algorithm can be parallelized by separating it into small tasks that can each be computed by a single processor through accessing the DFS. The access time of I/O requests to the distributed file system is much longer than accessing files in a processor’s own local memory, so we would like to minimize the number of DFS accesses in addition to the computing resources consumed, i.e. total number of tasks performed.

In this paper, we will assume the computation model as described in Figure 1. There is a large distributed file system, which all the processors have access to. There are a collection of processors (CPUs) each with a small fixed size local memory. One CPU is designated the master CPU, and it manages the task queue as well as the termination and output of the algorithm. The remaining processors are designated worker CPUs, and they perform tasks assigned to them from the task queue. The cost will be counted in terms of the amount of computing resources that the entire computation consumes, e.g. the number of tasks performed, DFS accesses per task, and storage used in the DFS. In many cloud computation systems, the computing resources are shared across many jobs that are running on the cloud, therefore, the tasks queue maybe include tasks corresponding to unrelated jobs as well.

## 3 Algorithm Intuition

Given a vector and matrix such that , our goal is to approximate the component of the solution vector to . Classic stationary linear iterative methods use updates of the form to iteratively approximate leading terms of the Neumann series. The matrix-vector multiplication can be performed in a distributed manner by splitting it into update tasks, where each task updates a single coordinate according to

 x(t+1)u←∑v∈NuGuvx(t)v+zu

for some . These tasks are added to the task queue and assigned to different processors to compute. Since can be fully dense, the vectors will be at least as dense as , thus computing from involves individual coordinate update tasks. In our problem, since we are specifically interested in a single component , we instead define a residual-based update method which maintains sparsity of the intermediate vector involved in the computation. We will first present a synchronous distributed version of the algorithm. In section 4, we will present the asynchronous distributed implementation of the algorithm, and argue that even when the updates are performed asynchronously, the algorithm still converges to the correct solution. Both implementations require at most space in the DFS to store the matrix , vector , an any intermediate values involved in the computation.

According to the Neumann series representation of ,

 xi=eTi∞∑k=0Gkz=zT∞∑k=0(GT)kei. (2)

Consider defining the residual vector at iteration as . Observe that the sparsity pattern of is given by , the radius neighborhood of node . We can rewrite as a function of the residual vectors

 xi=zTt−1∑k=0r(k)+xTr(t).

Let denote our estimate of at iteration . We can iteratively approximate with the low order terms of the Neumann series using the following updates:

 ^x(t+1)i ←^x(t)i+zTr(t), (3) r(t+1) ←GTr(t), (4)

and initializing with , and . Since the sparsity of is at most the size of the -radius neighborhood of node , denoted , the computation involved in one iteration can be split into single coordinate update tasks, corresponding to updating each coordinate . A task updating coordinate executes the following steps:

3. For each , ADD to .

Each update task uses at most O() DFS accesses, and does not require more than constant space in the local memory. The processor can only store the value of , and sequentially access and compute for , requiring DFS accesses, but only O(1) memory for stored information across computations. We initialize the estimate and residual vectors for the iteration with and .

The processors still need to pay a synchronization cost due to coordinating the iterations of computation. This results in delays as tasks for a new iteration must wait until every update task from the previous iteration completes. Termination can be determined by imposing a condition on the residual vector which is checked after each iteration of computation, such as terminating when . In section 5, we will prove convergence rate bounds for the synchronous implementation and discuss the gains the algorithm attains from a coordinate-based computation as opposed to computing the full vector.

The method described above requires coordination amongst the tasks to track each iteration of the algorithm. This may cause unnecessary delays due to enforcing that the tasks must be completed in a specific order. In this section, we introduce an asynchronous implementation of the algorithm, in which the update tasks may be performed in arbitrary order, and we do not need to wait for previous tasks to complete before beginning to compute a new task. In section 6, we prove that the algorithm always converges, and we establish convergence rate bounds for different coordinate update rules.

In the asynchronous implementation, since we no longer keep track of any iterations of the algorithm, we will simply store a single instance of the residual vector in the DFS. When the different tasks update their corresponding coordinates, they will read and write their updates to the residual vector stored on the DFS. The algorithm is initialized in the same way with and . A task to update residual coordinate involves three steps:

1. READ , and SET to ,

3. For each , ADD to .

The value of used in steps 2 and 3 is the original value read from the DFS in step 1. For each task, the worker processor makes O() DFS accesses, and the computation is fairly simple. This is the same computation as the individual tasks in the synchronous implementation, except without keeping track of the residual vector across distinct iterations. As a result, even when the sequence of coordinate updates is the same, the residual vector in the asynchronous implementation will evolve differently. For example, if , and is updated after , then in the asynchronous implementation, when the algorithm is updating coordinate , it will use the updated value in which the previous update from coordinate added to the value of . In section 6.1, we introduce an interpretation of the algorithm as summing weighted walks in the graph. The synchronous implementation always sums the walk in a breadth first manner, such that all walks of length are summed in the iteration, wheras the asynchronous implementation may sum walks of different lengths in a single update.

For the purposes of analyzing the convergence rate bounds, we consider that the three steps involved in a single update task are performed together as a single unit of computation, i.e., that the different steps involved in a single update task are executed together, and do not interleave with other tasks. We let denote the estimate after update tasks have completed, and we let denote the residual vector after update tasks have completed. This property can be enforced through read and write locks, which would prevent another task from simultaneously changing the value of while a particular task is in the middle of computation involving . This allows us to clearly track the value of the residual vector after each update task, lending to convergence rate bounds.

### 4.2 Coordinate Update Rule

In the asynchronous implementation, we are given more flexibility to choose the order in which the coordinates are updated. We could update in the same order as the synchronous implementation, in which we round robin update coordinates according to a breadth first traversal over the graph, i.e., updating first all coordinates in , followed by . Similarly we can iterate round robin updates for all coordinates with nonzero residual vector values. This can be coordinated by designating one processor as the “master”, whose job is to add tasks to the the task queue.

In settings where we would like to elimination coordination between tasks from a master processr, we can use randomization to generate the tasks or coordinate update order. To approximate the round robin procedure, we could probabilistically choose the next update coordinate by sampling uniformly randomly from all coordinates with nonzero residual values, which we term the ‘uniform censored sampling’ procedure. As each processor finishes a task, it can generate the next task by sampling a new update coordinate. This can be facilitated by storing the value of as well as a list of coordinates with nonzero valued residuals, and the update tasks can easily be modified to maintain the value of and list of relevant coordinates.

As our algorithm is derived from residual based updates, and the estimation error is given by , this suggests that we may make more progress if we focus on updating coordinates with large residual values. For example, we can choose to always update the coordinate with the largest residual value. This can be implemented by maintaining a priority queue with the residual values. We could also sample a coordinate probabilistically proportional to some function of the residual, e.g., proportional to , or . This may be more difficult to implement without iterating through the residual vector to generate each sample, though it is still possible to implement in our distributed computation model with a larger number of DFS accesses.

### 4.3 Termination

The termination conditions can be chosen depending on the desired accuracy and the value of the residual vector. The error is given by , but since we do not know the value of , we can design the termination condition as a function of . For example, terminating when results in an additive error bound of at most . The individual update tasks can be modified to additionally keep track of , , or without incurring much overhead, since these quantities are additive across coordinates, and each update task changes at most coordinates of .

We are motivated by network analysis settings in high dimension, such as computing Pagerank or Bonacich centrality when is large. As grows to infinity for some large graph, is in fact normalized, bounded, and doesn’t scale with for these three example network centralities. Most of the mass is contained in a few components, implying that an additive error bound of approximately guarantees a multiplicative error for large weight nodes, and an additive error for small weight nodes. Therefore, we will present many of our results assuming the algorithm uses a termination condition of .

## 5 Synchronous Analysis

In order to compare the convergence rate bounds for the asynchronous implementation, we first analyze the synchronous implementation. We will count the number of tasks and multiplications that the synchronous implementation uses. This analysis will help us to appreciate and identify the gains the algorithm makes due to sparsity and local computation.

{theorem}

If , the synchronous implementation of the algorithm converges, and estimation error decays as

 |^x(t)i−xi|=r(t)x≤∥G∥t2∥x∥2.

The total number of update tasks the algorithm performs in iterations is

 O(t−1∑k=0|Ni(k)|)=O(min(dt,nt)),

where is the set of nodes which are within a -radius neighborhood of node , and .

Corollary 5 follows from the proof of Theorem 5, and highlights that if the graph is sparse, or the size of the neighborhood grows slowly, then the complexity of the algorithm can scale much better than computing the entire solution vector, which would cost update tasks.

{corollary}

If we terminate the algorithm when
, then , and the total number of update tasks performed is bounded by

 O(min(ϵln(d)/ln(∥G∥2),nln(ϵ)ln(∥G∥2))).
###### Proof.

Proof of Theorem 5. The initial vectors and update rules are chosen such that , and . Therefore for all , , and the error in the estimate at iteration is given by . When , the error converges to zero, and thus the algorithm converges. It follows that the error is bounded by

 |^xi−xi| =|r(t)Tx|≤∥r(t)∥2∥x∥2. (5)

When the algorithm terminates at , the error is bounded by , and after iterations, the error is bounded by . Since each row of has at most nonzero entries, . The number of coordinate update tasks in each iteration is at most Therefore, we can upper bound the total number of tasks in iterations by by

 O(t−1∑k=0|Ni(k)|)=O(min(dt,nt)).

Since decays as , the algorithm terminates at within at most iterations, upper bounding the tasks performed by

 O(min(ϵln(d)/ln(∥G∥2),nln(ϵ)ln(∥G∥2))).
\halmos

When is nonsymmetric, , yet , the algorithm still converges asymptotically, though our rate bounds no longer hold. We suspect that in this case, a similar convergence rate holds as a function of , due to Gelfand’s spectral radius formula, which states that . The precise rate depends on the convergence of this limit.

The right hand expression in the theorem comes from bounding the number of coordinate updates in each iteration by , which holds even in the nonsparse setting. This bound obtains the same result as standard linear iterative methods. The analysis of our algorithm highlights the improvement of our local algorithm over a general global vector computation. The number of tasks grows as , which for some graphs may be significantly less than . For a bounded degree graph, which may be much less than in the case when is fixed, and is very large (recall that is on the order of ). The bounded degree condition is used in our analysis to cleanly bound , however our results naturally extend to other graphs given bounds on . For power law graphs, we can use a bound on the growth of the local neighborhood size for average nodes to obtain non-trivial convergence rate results. For graphs in which the size of the neighborhood only grows polynomially, then the local algorithm would gain significant savings over the global algorithm. This results in conditions under which our algorithm achieves an approximation for in constant time with respect to the size of the matrix for large , e.g. and .

We can visualize the algorithm in terms of computation over . Multiplying by corresponds to a message passing operation from each of the nonzero coordinates of along their adjacent edges in the graph. The sparsity of grows as a breadth first traversal over the graph starting from node . The termination condition guarantees that the algorithm only involves nodes that are within distance from the node . We define the matrix such that if , and is zero otherwise. It follows that

 ∥r(t)∥2 =∥eTiGt∥2=∥eTit∏k=1GNi(k)∥2 (6) =∥eTiGtNi(t)∥2≤∥GNi(t)∥t2. (7)

It is possible that for some choices of and , , in which case the algorithm would converge more quickly as a function of the local neighborhood. If corresponds to a scaled adjacency matrix of an unweighted undirected graph, then it is known that

 max(daverage,√dmax)≤ρ(G)≤dmax. (8)

In this case, we would only expect to be smaller than if the local degree distribution of the neighborhood around node is different from the global degree distribution.

## 6 Asynchronous Analysis

It is not as straightforward to analyze the asynchronous implementation of the algorithm, since we can no longer write the residual vector as a simple expression of and the iteration number. However, we can show that each coordinate update task preserves an invariant which relates the estimate and residual to the true solution .

{lemma}

[Invariant] The update tasks in the asynchronous implementation maintain the invariant that for all , .

###### Proof.

Proof of Lemma 6. Recall that . We prove that the invariant holds by using induction. First verify that before any computation has begun, the invariant is satisfied by the initialized values,

 ^xi+rTx=0+eTix=xi.

Let denote the residual vector before an update task, and let denote the residual vector after an update task. Then a single update task corresponds to the following steps:

 ^xnewi =^xoldi+rolduzu, rnewu =Guuroldu, rnewv =roldv+Guvroldu,∀v∈Nu.

Assuming that , it follows that

 ^xnewi+⟨rnewx⟩−^xoldi−⟨roldx⟩ =rolduzu+xu(Guu−1)roldu+∑v∈NuxvGuvroldu, =eTu(z+Gx−x)roldu=0.
\halmos

It follows from Lemma 6 that we can choose termination conditions based upon the value of the residual vector which would directly lead to upper bounds on the estimation error. For example, if , then .

### 6.1 Counting Weighted Walks

Alternatively, we can take the perspective that the algorithm is computing by collecting a sum of weighted walks over the graph beginning at node . The estimate corresponds to the sum of all weighted walks which are already “counted”, and the residual vector represents all yet uncounted walks. As long as step 1 of the coordinate update task is atomic, we can ensure that every walk is accounted for exactly once, either in , or in the residual vector. Let denote the matrix where . Theorem 6.1 uses the perspective of counting weighted walks to show that as long as , the algorithm converges to as long as each coordinate is chosen infinitely often, regardless of the sequence in which the updates are performed.

{theorem}

If , the estimate from the asynchronous implementation of our algorithm converges to for any sequence of coordinate updates, as long as each coordinate is updated infinitely often.

The solution can be expressed as a weighted sum over all walks over the graph beginning at node , where a walk beginning at node and ending at node has weight . The updates ensure that we never double count a walk, and all uncounted walks are included in the residual vector . For any , there is a finite time after which all random walks of length less than or equal to have been counted and included into . This allows us to upper bound as a function of , which converges to zero when . If , then the original Neumann series stated in (1) is only conditionally convergent. By the Reimann series theorem, the terms can be rearranged in such a way that the new series diverges, and rearranging the terms in the series corresponds to updating the coordinates in different orders, e.g., depth first as opposed to breadth first. This is the same conditions for asymptotic convergence as provided for the asynchronous linear iterative updates, which is also shown to be tight Bertsekas and Tsitsiklis (1989). This theorem and proof can be extended to show that the algorithm converges asymptotically even given communication delays, as long as the messages reach their destination in finite time.

###### Proof.

Proof of Theorem 6.1. Let denote the set of length walks beginning from node , i.e., a sequence of coordinates such that and for all . Then

 xi =eTi∞∑k=0Gkz=∞∑k=0∑w∈Wk(i)k−1∏s=0Gwsws+1zwk, =zi+∞∑k=1∑w∈Wk(i)k−1∏s=0Gwsws+1zwk, =zi+∑j∈NiGij∞∑k=0∑w∈Wk(j)k−1∏s=0Gwsws+1zwk, =zi+∑j∈NiGijxj. (9)

This shows that due to the form of the weights over each walk, we can express the weighted sum of all walks whose first edge is by . Similarly, this argument extends recursively to show that the weighted sum of all walks whose first nodes are given by is equivalent to since captures the sum and weights of the remaining unfinished portion of the walks. In other words, the weighted sum of walks with a certain prefix is equal to the product of the weights from the prefix portion of the walk multiplied with the sum of the weights for all walks that could continue from the endpoint of the prefix walk. The value of the residual vector contains the product of the weights along the prefix portion of the walks that end at node , thus explaining why is equivalent to the weight of all yet uncounted walks.

We use this perspective to interpret the single coordinate updates in the algorithm to argue that there is conservation of computation, i.e., we never double count a walk, and all uncounted walks are included in the residual mass vector . The update in our algorithm can be interpreted through (9), as the first term captures the walks that end at node , and each term within the summation counts the walks which take the next edge . Each time that a coordinate is updated, the step

corresponds to counting the walks which end at node , and whose weight is contained within . We also need to count the walks which have the same prefix, but do not yet terminate at node , which is captured by multiplying by each of its adjacent nodes and adding that to :

1. SET to ,

2. For each , ADD to .

We proceed to argue that for any , there is a finite time after which all random walks of length less than or equal to have been counted and included into the estimate .

Given a coordinate update sequence , we require that each coordinate appears infinitely often. We can assume that , since at iteration 0 all the mass is at node , thus it is the only node to update. Let denote the earliest time after which all of the neighbors of node have been updated at least once:

 S1=min{t≥1:Ni(1)⊂{u0,u1,…ut−1}}.

This guarantees that includes the weights from all the length one walks from node . We proceed to let denote the earliest time which all nodes within a 2-neighborhood of node have updated once after time :

 S2=min{t≥S1:Ni(2)⊂{uS1,uS1+1,…ut−1}}.

This now guarantees that includes the weights from all the length one and two walks from node . We can iteratively define as the time after which the estimate vector has counted and included all walks of length up to :

 Sl=min{t≥Sl−1:Ni(l)⊂{uSl−1,uSl−1+1,…ut−1}}.

Since each coordinate appears infinitely often in the sequence, is well defined and finite for all .

Finally we upper bound the error by using a loose upper bound on the weights of all walks with length larger than . By the invariant, it follows that , which is the sum of all counted or included walks in , and the remaining weight of uncounted walks in . The weighted sum of all walks of length at most from node is expressed by . Thus the error, or weight of uncounted walks, must be bounded by the corresponding weighted sum of the absolute values of the walks of length larger than :

 −zT∞∑k=l+1(~GT)kei≤r(Sl)Tx≤zT∞∑k=l+1(~GT)kei.

It follows that

 ∣∣xi−^x(Sl)i∣∣≤zT∞∑k=0(~GT)k(~GT)l+1ei=xT(~GT)l+1ei,

which converges to zero as long as . \halmos

In fact our proof for the asymptotic convergence translates directly into a convergence rate bound as well.

{theorem}

Suppose the asynchronous implementation of our algorithm used the coordinate update sequence , where each coordinate updates infinitely often. Define as the time after which the estimate vector has counted and included all walks of length up to :

 Sl=min{t≥Sl−1:Ni(l)⊂{uSl−1,uSl−1+1,…ut−1}}.

Then the estimation error of the the algorithm after updates is bounded by

 ∣∣xi−^x(Sl)i∣∣≤xT(~GT)l+1ei≤∥~G∥l+12∥x∥2.

Based upon the update sequence we can compute bounds on , or the time after which all walks of length have definitely been counted. We can analyze the basic coordinate update rule which follows the same pattern as the synchronous implementation, in which we update according to the neighborhoods of . The update sequence would be given by . Although the update order may be the same as the synchronous algorithm, the computation is not the same due to the accumulation of the residual vector across update tasks. It follows that due to the update order, If the graph is bounded degree with max degree , such that , then , resulting in the following corollary.

{corollary}

Suppose the asynchronous implementation of our algorithm used the coordinate update sequence . Then the estimation error of the algorithm after update tasks is bounded by

 |xi−^xi|≤∥~G∥l+12∥x∥2.

It follows that the error is less than for .

This matches the convergence rate bound for the synchronous algorithm when is nonnegative, which is reasonable for some applications in which is derived from network data. It also follows directly from Theorem 6.1 that if every coordinate updates at least once within every timesteps, then the error decays with rate , which is comparable to the bounded delay model and analysis for the asynchronous linear iterative algorithm Bertsekas and Tsitsiklis (1989).

{corollary}

Suppose the asynchronous implementation of our algorithm used a coordinate update sequence in which for some . Then the estimation error of the the algorithm after updates is bounded by

 |xi−^xi|≤∥~G∥l+12∥x∥2.

It follows that the error is less than for .

### 6.2 Probabilistic Update Order

When the coordinates are sampled probabilistically, we can no longer guarantee that a certain set of coordinates are updated within a fixed interval. In this section, we instead provide a probabilistic analysis of the error by analyzing the evolution of the 2-norm of the residual vector in expectation. We will assume that each coordinate update task is atomic, such that if the sequence of coordinate updates is given by , the residual vector after updates will be equivalent to the following computation:

 r=(t−1∏s=0(I−euseTus(I−G)))Tei.

The precise expression depends on the detailed order of updates, and thus the convergence rate may depend upon the rule that the algorithm chooses to determine the order of updating coordinates.

We provide an analysis for ‘uniform censored sampling’, in which coordinates with nonzero valued current residuals are chosen with equal probability, according to

 P(u)=I(r(t)u≠0)∥r(t)∥0, (10)

where denotes the current residual after updates. We have suppressed the dependence of on for simpler notation. Since the distribution chooses uniformly among the nonzero coordinates of , in expectation the update step corresponds to multiplying a scaled version of matrix to vector . We can prove that in addition contracts with high probability due to the choice of distribution . With high probability, the number of multiplications the asynchronous algorithm uses is bounded by a similar expression as the bound given for the synchronous algorithm.

{theorem}

If , with probability 1, the asynchronous implementation which updates coordinates according to eventually terminates at and produces an estimate such that . With probability greater than , the total number of update tasks bounded by

 O(min((ϵ√δ/2)−d/(1−∥G∥2),−nln(ϵ√δ)1−∥G∥2)).

Lemma 9 states that

 EP[r(t+1)∣∣r(t)] =(I−(I−GT∥r(t)∥0))r(t). (11)

Thus, in expectation, the error contracts in each update task by at least . We use this to prove an upper bound on the expected L2-norm of the residual vector after update tasks, and we apply Markov’s inequality to prove that the algorithm terminates with high probability within a certain number of multiplications. There is additional technical detail in the formal proof, as it needs to handle the fact the is dependent on the full history of the previous iterations. We first analyze the algorithm for a modified distribution, where the scaling factor grows deterministically according to as opposed to , and we use a coupling argument to show that the upper bound on the termination time of the algorithm using the modified distribution translates to the original distribution . This establishes an upper bound on the Lyapunov exponent for a product of random matrices drawn from a time-dependent distribution.

This bound grows exponentially in , while the corresponding bound in Corollary 5 grows only polynomially in . The rate of convergence of the asynchronous variant is slower by a factor of because the provable contraction of the error in each iteration is now spread out among the nonzero coordinates of the current iterate.

###### Proof.

Proof of Theorem 6.2. Since the algorithm terminates when , it follows from Lemma 6 that . Recall that the algorithm chooses a coordinate in each iteration according to the distribution , as specified in (10). To simplify the analysis, we introduce another probability distribution , which has a fixed size support of rather than . We first analyze the convergence of a modified algorithm which samples coordinates according to . Then we translate the results back to the original algorithm.

Observe that for any , there exists a function , which satisfies the properties that for any and , if , then , and if , then . In words, is a function which takes a vector of sparsity at most , and maps it to a binary valued vector which preserves the sparsity pattern of , yet adds extra entries of 1 in order that the sparsity of the output is exactly . We define the distribution to choose uniformly at random among the nonzero coordinates of , according to:

 ~P(u)=eTuSt(r(t))min