ExcessRisk of Distributed Stochastic Learners
Abstract
This work studies the learning ability of consensus and diffusion distributed learners from continuous streams of data arising from different but related statistical distributions. Four distinctive features for diffusion learners are revealed in relation to other decentralized schemes even under leftstochastic combination policies. First, closedform expressions for the evolution of their excessrisk are derived for stronglyconvex risk functions under a diminishing stepsize rule. Second, using these results, it is shown that the diffusion strategy improves the asymptotic convergence rate of the excessrisk relative to noncooperative schemes. Third, it is shown that when the innetwork cooperation rules are designed optimally, the performance of the diffusion implementation can outperform that of naive centralized processing. Finally, the arguments further show that diffusion outperforms consensus strategies asymptotically, and that the asymptotic excessrisk expression is invariant to the particular network topology. The framework adopted in this work studies convergence in the stronger meansquareerror sense, rather than in distribution, and develops tools that enable a close examination of the differences between distributed strategies in terms of asymptotic behavior, as well as in terms of convergence rates.
distributed stochastic optimization, diffusion strategies, consensus strategies, centralized processing, excessrisk, asymptotic behavior, convergence rate, combination policy
I Introduction
Machine learning applications rely on the premise that it is possible to benefit from leveraging information collected from different users. The range of benefits, and the computational cost necessary to analyze the data, depend on how the information is mined. It is sometimes advantageous to aggregate the information from all users at a central location for processing and analysis. Many current implementations rely on this centralized approach. However, the rapid increase in the number of users, coupled with privacy and communication constraints related to transmitting, storing, and analyzing huge amounts of data at remote central locations, have been serving as strong motivation for the development of decentralized solutions to learning and data mining [3, 4, 5, 6, 7, 8, 9, 10, 11].
In this work, we study the distributed realtime prediction problem over a network of learners. We assume the network is connected, meaning that any two arbitrary agents are either connected directly or by means of a path passing through other agents. We do not expect the agents to share their data sets but only a parameter vector (or a statistic) that is representative of their local information. Such networks serve as useful models for peertopeer networks and social networks. The objective of the learning process is for all nodes to minimize some objective function, termed the risk function, in a distributed manner. We shall compare the performance of cooperative and noncooperative solutions by examining the gap between the risk achieved by the distributed implementations and the risk achieved by an oracle solution with access to the true distribution of the input data; we shall refer to this gap as the excessrisk.
Among other contributions, this work studies stochastic gradientbased distributed strategies that are shown here to converge in the meansquareerror sense when a decaying stepsize sequence is used, and that are also shown to outperform other implementations, even under leftstochastic combination rules [12, 13, 14, 15]. Specifically, we will show that the strategies under study achieve a better convergence rate than noncooperative algorithms, and we will also explain why diffusion strategies outperform other distributed solutions such as those relying on consensus constructions or on doublystochastic combination policies [16, 4, 17, 18, 19], as well as naïve centralized algorithms [3]. It was previously shown that the diffusion strategies outperform their consensusbased counterparts in the constant stepsize scenario [20]. We analytically show that the same conclusion holds in the diminishing stepsize scenario even as the stepsize decays. We also illustrate in the simulations that while diffusion and consensusbased algorithms have the same computational complexity, it turns out that diffusion algorithms reduce the overshoot during the transient phase. In comparison to the useful work [16], our formulation studies convergence in the stronger meansquareerror sense, and develops analysis tools that do not depend on using the central limit theorem or on studying convergence in a weaker distributional sense. In addition, unlike the works [21, 10, 3, 22], we are not solely interested in bounding the excessrisk. Instead, we are interested in obtaining a closedform expression for the asymptotic excessrisk of distributed and nondistributed strategies in order to compare and optimize their absolute asymptotic performance.
Recently, there has also been interest in primaldual approaches for distributed optimization [23, 24, 25, 6, 26]. Generally, these approaches are studied in the deterministic optimization context where the iterates are not prone to noise or when the risk function is nondifferentiable. It was demonstrated in [23] that the primal diffusion strategy studied in this manuscript also outperforms augmented Lagrangian and ArrowHurwicz primaldual algorithms in the stochastic constantstepsize setting in both stability and steadystate performance. It is possible to carry out similar comparisons in the diminishing stepsize scenario, but this manuscript is focused on the study of primal approaches. As the extended analysis and derivation in later sections and appendices show, this case is already demanding enough to warrant separate consideration in this work.
The techniques developed will allow us to examine analytically and closely the differences between distributed strategies in terms of asymptotic behavior, as well as in terms of rates of convergence by exploiting properties of Gamma functions and the convergence properties of products of infinitely many scaling coefficients. For instance, when the noise profile is uniform across all agents, one of our conclusions will be to show that the convergence rate of diffusion strategies is in the order of , where the notation means that the sequence decays at the same rate as for sufficiently large , i.e., there exist positive constants and and an integer such that for all . This rate is consistent with the result established for consensus implementations under doublystochastic combination policies in [17, 19, 16] albeit under a weaker convergence in distribution sense, where it was argued that the estimation error approaches a Gaussian distribution whose covariance matrix scales as . On the other hand, when the noise profile is nonuniform across the agents, the analysis will show that diffusion methods can surpass this rate. These and other useful conclusions will follow from the detailed meansquare and convergence analyses that are carried out in the sequel. The theoretical findings are illustrated by simulations in the last section.
For ease of reference, we summarize here the main conclusions in the manuscript:

We derive a closedform expression (and not only a bound) for the asymptotic excessrisk curve of the distributed strategies.

We analyze the derived expression to conclude that the asymptotic performance depends on the network topology solely through the Perron vector of the combination matrix used in the strategy. In this way, different topology structures with the same Perron vector are shown to attain the same asymptotic performance. That is, the full eigenstructure of the topology are become irrelevant in the asymptotic regime.

We show that once the Perron vector is optimized to minimize the asymptotic excessrisk, it is possible to construct a combination matrix with that Perron vector in order to attain the optimal performance in a fully distributed manner.

We compare the asymptotic excessrisk performance of the diffusion strategy to centralized and noncooperative strategies to conclude that the diffusion strategy can attain the performance of a weighted centralized strategy asymptotically.

We compare the asymptotic excessrisk performance of the diffusion strategy to consensus distributed strategies to conclude that the asymptotic excessrisk curve of the consensus strategy will be worse than that of the diffusion strategy.

We verify our conclusions through simulations.
Notation: Random quantities are denoted in boldface. Throughout the manuscript, all vectors are column vectors. Matrices are denoted in capital letters, while vectors and scalars are denoted in lowercase letters. Network variables that aggregate variables across the network are denoted in calligraphic letters. Unless otherwise noted, the notation refers to the Euclidean norm for a vector and to the matrix norm that is induced by Euclidean norm for vectors. Furthermore, the notation denotes the Kronecker product operation [27, p. 139]. The notation denotes a vector of dimension with all its elements equal to one.
Ii Problem Formulation and Algorithms
Consider a network of learners. Each learner is subject to a streaming sequence of independent data samples , for , arising from some fixed distribution . The goal of each agent is to learn the vector that optimizes the average of some loss function, say, , where the expectation is over the distribution of the data and is the vector variable of optimization. For example, in order to learn the hyperplane that best separates feature data belonging to one of two classes , a regularized logisticregression (RLR) algorithm would minimize the expected value of the following loss function over (with the expectation computed over the distribution of the data ) [22]:
(1) 
while a meansquareerror algorithm would minimize the expected value of the quadratic loss [28] (also referred to as the “delta rule” [29]):
(2) 
The expectation of the loss function over the distribution of the data is referred to as the risk function [30, p. 16]:
(3) 
and we denote the optimizer of (3) by :
(4) 
where is unique when is stronglyconvex, which we shall assume for the remainder of the manuscript. The assumption of strongconvexity of the risk function is important in practice since the convergence rate of most stochastic approximation strategies will be significantly reduced when the condition does not hold [22]. This is not a limitation in most problems arising in the context of adaptation and learning since regularization (such as ) is often used and it helps ensure strong convexity. The risk function can be viewed as a measure of the “predictionerror” of a classifier or regression method since it evaluates the performance of the method against samples taken from the distribution of the input data that have not yet been observed by the classifier/regressor [30, p. 16]. The risk serves as a measure about how well an estimate will perform on a new sample on average. For this reason, the risk is also referred to as the generalization ability of the classifier.
We will assume for the remainder of this exposition that the optimizer is the same for all nodes . This case is common in both machine learning (where, for example, for all ) and distributed inference applications where the distributions are dependent on a common parameter vector to be optimized (see Sec. V further ahead). In order to measure the performance of each learner, we define the excessrisk (ER) at node as:
(5) 
where denotes the estimator of that is computed by node at time (i.e., it is the estimator that is generated observing past data within the neighborhood of node ). The excessrisk is nonnegative because is stronglyconvex and, therefore, for all . The expectation in (5) is over the data since is a random quantity that depends on all the data samples up to time (i.e., ). The dependence on the data from the other agents arises from the network topology. Our interest in this work is to characterize the convergence rate, to zero, of the excessrisk for various distributed and nondistributed strategies of learning for a given loss function. We also derive closedform expressions for the asymptotic excessrisk and compare the absolutevalue of the excessrisk curves for algorithms that converge at the same rate.
There are various approaches for optimizing (4). We concentrate on fullydistributed strategies that operate over sparselyconnected networks. The concept of a fully distributed strategy is used here to mean the following:

There is no central node that is coordinating the communication and computation during the learning process.

A node does not need to be connected to all other nodes. Indeed, as long as the network is connected (and it can be sparsely connected), the algorithm is able to approach the solution to the global learning problem.

Only onehop communication is allowed during the learning process. That is, we do not allow the routing of a data packet over the network. Instead, each agent/node is only allowed to be directly communicating with its intermediate neighbors.
Figure 1 illustrates the types of topologies examined in this manuscript. It is important to notice that the centralized and fullyconnected topologies are theoretically equivalent, but practically different as the centralized topology greatly reduces the amount of information that is communicated throughout the network. The centralized topology, however, is not robust to node failure since the entire solution breaks down if the central node fails. Throughout the remainder of the manuscript, we will refer to the centralized and fullyconnected approaches interchangeably since they have identical excessrisk performance.
Iia NonCooperative Strategy
First, we examine the noncooperative strategy for optimizing (4), which is for each node to run independently a stochastic gradient algorithm of the following form for [31, 32, 33, 34]:
(6) 
where denotes the gradient vector of the loss function, and is a stepsize sequence. The gradient vector employed in (6) is an approximation for the actual gradient vector, , of the risk function. The difference between the true gradient vector and its approximation used in (6) is called gradient noise. Due to the presence of the gradient noise, the estimate generated by (6) becomes a random quantity; we use boldface letters to refer to random variables throughout our manuscript, which is already reflected in our notation in (6).
It is shown in [31, 35] that for stronglyconvex risk functions , the noncooperative scheme (6) achieves an asymptotic convergence rate in the order of under some conditions on the gradient noise and the stepsize sequence , where the notation means that the sequence decays at a rate that is at most the rate of decay of for sufficiently large —i.e., there exist positive constant and an integer such that for all . In this way, in order to achieve an excessrisk accuracy on the order of , the noncooperative algorithm (6) would require samples. It is further shown in [31, 36] that no algorithm can improve upon this rate under the same conditions. This implies that if no cooperation is to take place between the nodes, then the best asymptotic rate each learner would hope to achieve is on the order of .
IiB Centralized Strategy
Now, in place of the noncooperative strategy, let us assume that the nodes transmit their samples to a central processor, which executes the following algorithm:
(7) 
It can be shown that this implementation will have an asymptotic convergence rate in the order of for stepsize sequences of the form (see Corollary 2). In other words, the centralized implementation (7) provides an fold increase in convergence rate relative to the noncooperative solution (6). One of the questions we wish to answer in this work is whether it is possible to derive a fully distributed algorithm that allows every node in the network to converge (in the meansquareerror sense) at the same rate as the centralized solution, i.e., , with only communication between neighboring nodes and for general adhoc networks. We show that this task is indeed possible. We will additionally show that the distributed strategy can outperform the naïve centralized implementation (7) when the gradient noise profile across the agents is nonuniform, but that it will match the performance of a weighted version of (7), namely, the following weighted centralized strategy:
(8) 
where the are convex combination weights that satisfy:
(9) 
and are meant to discount gradients with higher noise power compared to others. We next describe two popular fullydistributed strategies.
IiC Diffusion Strategy
Following the approach of [14, 32, 34], the diffusion strategy for evaluating distributed estimates for in (4) takes the following form:
[adaptation]  (10)  
[aggregation]  (11) 
where denotes the current data sample available at node . Each node begins with an estimate and employs a diminishing positive stepsize sequence . The nonnegative coefficients , which form a leftstochastic combination matrix , are used to scale information arriving at node from its neighbors. These coefficients are chosen to satisfy:
(12) 
We emphasize that we are only requiring to be leftstochastic, meaning that only each of its columns should add up to one rather than each of its columns and rows. The neighborhood for each node is defined as the set of nodes for which . The neighborhood is typically known to agent . The main difference between the above algorithm and the original adaptthencombine (ATC) diffusion strategy studied in [14, 34, 32] is that we are employing a diminishing stepsize sequence as opposed to a constant stepsize. Constant stepsizes have the distinct advantage that they allow nodes to continue adapting their estimates in response to drifts in the underlying data distribution [37]. In this work, we are interested in examining the generalization ability of distributed learners asymptotically when the underlying distribution, , remains stationary, in which case the use of decaying stepsizes sequences is justified. If the statistical distribution of the data were subject to drifts, then constant stepsizes become a necessity, and this scenario is already studied in some detail in [14, 15, 34, 32].
IiD Consensus Strategy
In addition to (10)–(11), we shall examine the following consensusbased implementation [4, 17, 18, 33] for solving the same problem (4):
(13) 
The diffusion and consensus strategies (10)(13) have exactly the same computational complexity, except that the computations are performed in a different order. We will see in Sec. IVD that this difference enhances the performance of diffusion over consensus. Moreover, in the constant stepsize case, the difference in the order in which the operations are performed causes an anomaly in the behavior of consensus solutions in that they can become unstable even when all individual nodes are able to solve the inference task in a stable manner; see [20, 34, 32]. Furthermore, consensus strategies of the form (13) are usually limited to employing a doublystochastic combination matrix . The analysis in the sequel will show that leftstochastic matrices actually lead to improved excessrisk performance (see Eqs. (58)–(60)) while convergence of the distributed implementation continues to be guaranteed (see Theorem 1).
Iii Main Assumptions
Before proceeding with the analysis, we list in this section the main assumptions that are needed to facilitate the exposition. The conditions listed below are common in the broad stochastic optimization literature — see the explanations in [31, 33, 34, 32]. The first condition assumes that the are stronglyconvex, with a common minimizer for . This condition ensures that the optimization problem (4) is wellconditioned.
Assumption 1 (Properties of risk functions)
Each risk function is twice continuouslydifferentiable and its Hessian matrix is uniformly bounded from below and from above, namely,
(14) 
Furthermore, the risks at the various agents are minimized at the same location:
(15) 
and the Hessian matrices are locally Lipschitz continuous at , i.e., for all , there exists some such that:
(16) 
We denote the value of the Hessian matrices at (assumed uniform across the agents) by
(17) 
where . We let denote the smallest eigenvalue of .
In fact, conditions (14) and the locally Lipschitz condition (16) jointly imply (16) globally (see Lemma E.8 in [32]); i.e., for all , we have that there exists some such that:
(18) 
Observe that
(19) 
One useful implication that follows from Assumption 1 is the following. Consider the expected excessrisk (5) at node . Using the following sequence of inequalities, we can bound the excessrisk by a square weighted norm:
(20a)  
(20b) 
where , denotes expectation over the distribution of , and the weighting matrix is defined as
(21) 
Step (a) is a consequence of applying the following meanvalue theorem [31, p. 24] [32] twice for an arbitrary realvalued differentiable function :
(22) 
and the fact that optimizes so that . Step (b) is due to (14) and (21).
Expression (20a) shows that the expected excessrisk at node is equal to a weighted meansquareerror with weight matrix (21). This means that one way to compute or bound the expected excessrisk is by evaluating weighted meansquareerror quantities of the form (20a) or (20b). This is the route that we will follow in this manuscript. We will analyze the righthand side of (20a) in order to draw conclusions regarding the evolution of the expected excessrisk. In particular, once we establish that the distributed algorithm converges in the meansquareerror sense, then inequality (20b) will immediately allow us to conclude that the algorithm also converges in excessrisk. Similarly, we can obtain the asymptotic expression for the excessrisk by leveraging the weightedmeansquareerror analysis developed for constant stepsize distributed strategies [14, 15, 32], adjusted for the decaying stepsize case. Observe that these conclusions are different than the useful results in [16], which focused on studying convergence in distribution. The meansquareerror results will enable us to expose analytically various interesting differences in the performance of distributed strategies, such as diffusion and consensus.
Our second condition is on the gradient noise process, which is defined, for a generic vector , as
(23) 
We collect the noises from across the network into a column vector
(24) 
where we are introducing the vector notation
(25) 
for the collection of parameters across the agents. We denote the covariance matrix of the gradient noise vector by
(26) 
where the conditioning is in relation to the past history of the estimators, . The following conditions are relaxations of assumptions that are regularly considered in the stochastic approximation literature; they are generally satisfied in important scenarios, such as logistic regression or quadratic loss functions of the form (1)–(2) — see [32].
Assumption 2 (Gradient noise model)
We assume the gradient noise process satisfies:
(27)  
(28)  
(29) 
for some , , as well as:
(30)  
(31) 
for some , , and where
(32) 
and (30) is assumed to hold for , for some small .
Observe that Assumption 2 implies that:
(33) 
for some and . In addition, the local Lipschitz condition (30) of order [38, p. 53] (sometimes referred to as the Hölder condition of order [39, p. 110]) implies, under Assumption 1, that the following global condition also holds [32, 15]:
(34) 
for some constant that depends on and where is from (30). Furthermore, due to (29), we have that the matrix is blockdiagonal:
(35)  
(36) 
Since nodes sample the data in an independent fashion, it is reasonable to expect the gradient noise to be uncorrelated across all nodes, as required by (29).
Our third condition relates to the structure of the network topology. We will assume that the network is stronglyconnected, which means that (a) there exists at least one nontrivial selfloop, for some , and (b) for any two agents and , there exists a path with nonzero weights from to , either directly if they are neighbors or through other agents. It is wellknown that the combination matrix for such networks is primitive [40, p. 516]. That is, all entries of are nonnegative and there exists some positive integer such that all entries of are strictly positive. One important property of primitive matrices follows from the PerronFrobenius theorem [40, p. 534]; will have a single eigenvalue at one, while all other eigenvalues of will lie strictly inside the unit circle. Moreover, if we let denote the righteigenvector associated with the eigenvalue at one, and normalize its entries to add up to one, i.e.,
(37) 
then all entries of will be strictly positive. We shall refer to as the Perron eigenvector of . We formalize this assumption in the following:
Assumption 3 (Network Topology)
The network is stronglyconnected so that the combination matrix is primitive with and , where denotes the the Perron eigenvector of .
Iv Main Results
In this section, we list the main results and defer the detailed proofs to the appendices.
Iva Convergence Properties
Our first result provides conditions on the stepsize sequence under which the diffusion strategy converges both in the meansquareerror (MSE) sense and also almost surely. The difference between the two sets of conditions that appear below is that in one case the stepsize sequence is additionally required to be squaresummable.
Theorem 1 (Convergence rates)
Let Assumptions 13 hold and let the stepsize sequence satisfy
(38) 
Then, generated by (10)–(11) converges in the MSE sense to , i.e.,
(39) 
If the stepsize sequence satisfies the additional squaresummability condition:
(40) 
then converges to almost surely (i.e., with probability one) for all . Furthermore, when the stepsize sequence is of the form with satisfying , then the second and fourthorder moments of the error vector converge at the rates of and , respectively:
(41)  
(42) 
where was introduced in (33).
See Appendix A. Observe that (41) implies that each node converges in the meansquareerror sense at the rate . Combining this result with (20b), we conclude that each node also converges in excessrisk at this rate:
(43) 
when .
Note that this conclusion does not yet reveal the benefit of cooperation (for example, it does not show how the convergence rate depends on ). In the next section, we will derive closedform asymptotic expressions for the meansquareerror and excessrisk, and from these expressions we will be able to highlight the benefit of network cooperation.
IvB Evolution of ExcessRisk Measure
We continue to assume that the stepsize sequence is selected as for some . This sequence satisfies conditions (38) and (40). Observe that in order to evaluate the excessrisk at node , we must evaluate (20a). To do so, we first form the following networkwide error quantity:
(44) 
and let denote the matrix with a single entry equal to one at the th location and all other entries equal to zero. Then, using (20a), we can write:
(45) 
In order to facilitate the analysis, we introduce the eigenvalue decomposition of matrix :
(46) 
where is an orthogonal matrix and is diagonal with positive entries . Moreover, since the matrix is leftstochastic and primitive (by Assumption 3), we can express its Jordan decomposition in the form:
(47) 
where represent the remaining left and right eigenvectors while represents the Jordan structure associated with the eigenvalues inside the unit disc.
Theorem 2 (Asymptotic Convergence of )
Let Assumptions 13 hold and let denote the smallest eigenvalue of the matrix :
(48) 
Then, when , it holds asymptotically that
(49)  
(50) 
where the notation implies that . Moreover, is the th eigenvalue of and the matrix is defined as:
(51) 
where the notation denotes the th diagonal element of the matrix .
See Appendix B. Theorem 2 establishes a closedform expression for the asymptotic excessrisk of the diffusion algorithm. We observe that the slowest rate at which the asymptotic term converges depends on the smallest eigenvalue of , which is , and the constant . Interestingly, the only dependence on the topology of the network asymptotically is encoded in the Perron vector of the combination matrix —i.e., most of the eigenstructure of the topology matrices becomes irrelevant asymptotically and only influences the convergence rate in the transient regime. We will see further ahead that the Perron vector can be optimized to minimize the excessrisk in the asymptotic regime. It is natural that the transient stage should depend on the network geometry because the networked agents are propagating their information over the entire network. The speed of information propagation over a sparsely connected network is determined by the second largest eigenvalue of the combination matrix [14], which is influenced by the degree of network connectivity. Our results show, however, that there is an asymptotic regime where the performance of the diffusion strategy can be made invariant to the specific network topology since the Perron vector can be designed for general connected networks, as we will see further ahead in (60). Finally, we observe that all agents participating in the network will achieve the same asymptotic performance given by the righthandside of (49) as this asymptotic expression for the excessrisk does not depend on any particular node index but only on the networkwide quantity .
When is not known, and thus it is not clear how to choose to satisfy , it is common to choose a large that forces . In this case, we obtain from (50) that
(52) 
This approximation is close in form to the original steadystate performance expression derived for the diffusion algorithm when a constant stepsize is used [41]. The main difference is that the “steadystate” term will now diminish at the rate when and .
By specializing the previous results to the case (a standalone node), we obtain as a corollary the following result for the expected excessrisk that is delivered by the noncooperative stochastic gradient algorithm (6).
Corollary 1 (Stochastic gradient approximation)
Observe that (53)–(54) are stronger than those in Theorem 1 since we are not only stating that the convergence rate is but we are also giving the exact constant that multiplies the factor . In the next section, we examine the relationship between the derived constant and the network size and noise parameters across the network. Following this presentation, we will utilize our meansquareerror expressions to examine the differences between the diffusion strategy (10)–(11) and the consensus strategy (13).
IvC Benefit of Cooperation
Combination Rule  

Average Rule  
Metropolis Rule [42]  
Hastings Rule [43] 
Up to this point in the discussion, the benefit of cooperation has not yet manifested itself explicitly; this benefit is actually already included in the vector . Optimization over will help bring forth these advantages. Thus, observe that the expression for the asymptotic term in (49) is quadratic in . We can optimize the asymptotic expression over in order to reduce the excessrisk. We rewrite the asymptotic excessrisk (49) as:
(56) 
where
(57) 
Then, we consider the problem of optimizing (56) over :
where denotes the set of leftstochastic and primitive combination matrices that satisfy the network topology structure. It is generally not clear how to solve this optimization problem over both and . We pursue an indirect route. We first remove the optimization over and determine an optimal . Subsequently, given the optimal , we show that a leftstochastic and primitive matrix in can be constructed. The relaxed problem is:
(58) 
whose solution is
(59) 
It is straightforward to verify that . A combination matrix that has this as its Perron eigenvector is the Hastings rule [44, 41] [32, Lemma 12.2], which is given by
(60) 
It is possible to see that for agent to implement the Hastings rule, it needs to know its neighborhood (which is known to agent ), as well as the number of neighbors that each of its neighbors has (this information is easily derived from the immediate neighbors), and the Perron vector that the network wishes to obtain. Therefore, the design of the weighting matrix can be done in a fully distributed manner. Table I lists three fullydistributed combination rules (combination rules that can be implemented in a fullydistributed manner) and their corresponding Perron vector.
To see the effectiveness of this choice for , we consider the case where , so that
(61) 
Substituting (59) into (56), we obtain:
(62)  
(63) 
where step is due to (61). First, we will compare this performance with that of the centralized algorithm (7). To do this, we first establish the following result:
Corollary 2 (Unweighted Centralized processing)
The centralized algorithm (7) is a special case of the diffusion algorithm when , where , which yields a network that satisfies Assumption 3. To see this, consider the diffusion algorithm (10)–(11) with :
(66)  
(67) 
First, observe that after the first iteration, the estimates across the network are now uniform since (67) does not depend on . We can therefore drop the subscript from :
(68)  
(69) 
Substituting (68) into (69), we obtain:
(70) 
where step is due to and step is obtained by substituting . Observe that (70) is identical to (7), the unweighted centralized algorithm. Then, using the analysis of the diffusion algorithm in Theorem 2, we have that
(71) 
Since the righthandside of (71) does not depend on the agent index (all agents will achieve the same asymptotic performance), we have that the average excessrisk remains the same:
(72) 
where step is due to (61), which is the desired result.
Comparing (63) to (54) in the special case when for all , we find that the diffusion algorithm offers an fold improvement in the excessrisk over the noncooperative solution. Also, comparing (63) to (65) in this case, we observe that asymptotically the diffusion algorithm achieves the same performance as the centralized algorithm (7). More generally, let us consider the case in which the noise profile is not uniform across the agents. We call upon the following inequality:
(73) 
which follows from the fact that the harmonic mean of a set of numbers is upperbounded by their arithmetic mean. Then, we conclude from (73) that the excessrisk of the diffusion strategy is upperbounded by that of the centralized strategy (7), and equality holds only when the network experiences a spatially uniform gradient noise profile. This implies that the diffusion strategy actually outperforms the implementation studied in [16], which uses a doublystochastic combination matrix. Furthermore, in this case of nonuniform noise profile and since