# Exact Diffusion for Distributed Optimization and Learning — Part I: Algorithm Development

## Abstract

This work develops a distributed optimization strategy with guaranteed exact convergence for a broad class of left-stochastic combination policies. The resulting exact diffusion strategy is shown in Part II [2] to have a wider stability range and superior convergence performance than the EXTRA strategy. The exact diffusion solution is applicable to non-symmetric left-stochastic combination matrices, while many earlier developments on exact consensus implementations are limited to doubly-stochastic matrices; these latter matrices impose stringent constraints on the network topology. The derivation of the exact diffusion strategy in this work relies on reformulating the aggregate optimization problem as a penalized problem and resorting to a diagonally-weighted incremental construction. Detailed stability and convergence analyses are pursued in Part II [2] and are facilitated by examining the evolution of the error dynamics in a transformed domain. Numerical simulations illustrate the theoretical conclusions.

distributed optimization, diffusion, consensus, exact convergence, left-stochastic matrix, doubly-stochastic matrix, balanced policy, Perron vector.

## I Introduction and Motivation

This work deals with deterministic optimization problems where a collection of networked agents operate cooperatively to solve an aggregate optimization problem of the form:

(1) |

In this formulation, each risk function is convex and differentiable, while the aggregate cost is strongly-convex. Throughout the paper, we assume the network is undirected. All agents seek to determine the unique global minimizer, , under the constraint that agents can only communicate with their neighbors. This distributed approach is robust to failure of links and/or agents and scalable to the network size. Optimization problems of this type find applications in a wide range of areas including wireless sensor networks [3, 4, 5, 6], multi-vehicle and multi-robot control systems [7, 8], cyber-physical systems and smart grid implementations [9, 10, 11, 12], distributed adaptation and estimation [13, 14, 15, 16, 17], distributed statistical learning [18, 19, 20] and clustering [21, 22].

There are several classes of distributed algorithms that can be used to solve problem (1). In the primal domain, implementations that are based on gradient-descent methods are effective and easy to implement. There are at least two prominent variants under this class: the consensus strategy [23, 24, 25, 26, 27, 28, 29, 30] and the diffusion strategy [13, 14, 15, 16, 17]. There is a subtle but critical difference in the order in which computations are performed under these two strategies. In the consensus implementation, each agent runs a gradient-descent type iteration, albeit one where the starting point for the recursion and the point at which the gradient is approximated are not identical. This construction introduces an asymmetry into the update relation, which has some undesirable instability consequences (described, for example, in Secs. 7.2–7.3, Example 8.4, and also in Theorem 9.3 of [14] and Sec. V.B and Example 20 of [13]). The diffusion strategy, in comparison, employs a symmetric update where the starting point for the iteration and the point at which the gradient is approximated coincide. This property results in a wider stability range for diffusion strategies [13, 14]. Still, when sufficiently small step-sizes are employed to drive the optimization process, both types of strategies (consensus and diffusion) are able to converge exponentially fast, albeit only to an approximate solution [14, 27]. Specifically, it is proved in [14, 31, 27] that both the consensus and diffusion iterates under constant step-size learning converge towards a neighborhood of square-error size around the true optimizer, , i.e., as , where denotes the step-size and denotes the error at agent and iteration relative to . Since we are dealing with deterministic optimization problems, this small limiting bias is not due to any gradient noise arising from stochastic approximations; it is instead due to the inherent structure of the consensus and diffusion updates as clarified in the sequel.

Another important family of distributed algorithms are those based on the distributed alternating direction method of multipliers (ADMM) [32, 33, 34] and its variants [35, 36, 37]. These methods treat problem (1) in both the primal and dual domains. It is shown in [34] that distributed ADMM with constant parameters will converge exponentially fast to the exact global solution . However, distributed ADMM solutions are computationally more expensive since they necessitate the solution of optimal sub-problems at each iteration. Some useful variations of distributed ADMM [35, 36, 37] may alleviate the computational burden, but their recursions are still more difficult to implement than consensus or diffusion.

In more recent work [38], a modified implementation of consensus iterations, referred to as EXTRA, is proposed and shown to converge to the exact minimizer rather than to an neighborhood around . The modification has a similar computational burden as traditional consensus and is based on adding a step that combines two prior iterates to remove bias. Motivated by [38], other variations with similar properties were proposed in [39, 40, 41, 42, 43]. These variations rely instead on combining inexact gradient evaluations with a gradient tracking technique. The resulting algorithms, compared to EXTRA, have two information combinations per recursion, which doubles the amount of communication variables compared to EXTRA, and can become a burden when communication resources are limited.

The current work is motivated by the following considerations. The result in [38] shows that the EXTRA technique resolves the bias problem in consensus implementations. However, it is known that traditional diffusion strategies outperform traditional consensus strategies. Would it be possible then to correct the bias in the diffusion implementation and attain an algorithm that is superior to EXTRA (e.g., an implementation that is more stable than EXTRA)? This is one of the contributions in this two-part work; Parts I and II[2]. In this part, we shall indeed develop a bias-free diffusion strategy that will be shown in Part II[2] to have a wider stability range than EXTRA consensus implementations. Achieving these objectives is challenging for several reasons. First, we need to understand the origin of the bias in diffusion implementations. Compared to the consensus strategy, the source of this bias is different and still not well understood. In seeking an answer to this question, we will initially observe that the diffusion recursion can be framed as an incremental algorithm to solve a penalized version of (1) and not (1) directly — see expression (76) further ahead. In other words, the local diffusion estimate , held by agent at iteration , will be shown to approach the solution of a penalized problem rather than , which causes the bias.

### I-a Contributions

We have three main contributions in this article and the accompanying Part II [2] relating to: (a) developing a distributed algorithm that ensures exact convergence based on the diffusion strategy; (b) developing a strategy with wider stability range and enhanced performance than EXTRA consensus; and (c) developing a strategy with these properties for the larger class of local balanced (rather than only doubly-stochastic) matrices.

To begin with, we will show in this article how to modify the diffusion strategy such that it solves the real problem (1) directly. We shall refer to this variant as exact diffusion. Interestingly, the structure of exact diffusion will turn out to be very close to the structure of standard diffusion. The only difference is that there will be an extra “correction” step added between the usual “adaptation” and “combination” steps of diffusion — see the listing of Algorithm 1 further ahead. It will become clear that this adapt-correct-combine (ACC) structure of the exact diffusion algorithm is more symmetric in comparison to the EXTRA recursions. In addition, the computational cost of the “correction” step is trivial. Therefore, with essentially the same computational efficiency as standard diffusion, the exact diffusion algorithm will be able to converge exponentially fast to without any bias. Secondly, we will show in Part II[2] that exact diffusion has a wider stability range than EXTRA. In other words, there will exist a larger range of step-sizes that keeps exact diffusion stable but not the EXTRA algorithm. This is an important observation because larger values for help accelerate convergence.

Our third contribution is that we will derive the exact diffusion algorithm, and establish these desirable properties for the class of locally balanced combination matrices. This class does not only include symmetric doubly-stochastic matrices as special cases, but it also includes a range of widely-used left-stochastic policies as explained further ahead. First, we recall that left-stochastic matrices are defined as follows. Let denote the weight that is used to scale the data that flows from agent to . Let denote the matrix that collects all these coefficients. The entries on each column of are assumed to add up to one so that is left-stochastic, i.e., it holds that

(2) |

The matrix will not be required to be symmetric. For example, it may happen that . Using these coefficients, when an agent combines the iterates it receives from its neighbors, that combination will correspond to a calculation of the form:

(3) |

It should be emphasized that condition (2), which is repeated in (3), is different from all previous algorithms studied in [23, 32, 33, 34, 36, 38, 42, 43], which require to be symmetric and doubly stochastic (i.e., each of its columns and rows should add up to one). Although symmetric doubly-stochastic matrices are common in distributed optimization, policies of great practical value happen to be left-stochastic and not doubly-stochastic. For example, it is shown in Chapters 12 and 15 of [14] that the Hastings rule (see (18)) and the relative-degree rule (see (27)) achieve better mean-square-error (MSE) performance over adaptive networks than doubly-stochastic policies. Both of these rules are left-stochastic. Also, as we explain in Sec. VI-C, the averaging rule (see (22)) leads to faster convergence in highly unbalanced networks where the degrees of neighboring nodes differ drastically. This rule is again left-stochastic and is rather common in applications involving data analysis over social networks. Furthermore, the averaging rule has better privacy-preserving properties than doubly-stochastic policies since it can be constructed from information available solely at the agent. In contrast, the doubly-stochastic matrices generated, for example, by the maximum-degree rule or Metropolis rule [14] will require agents to share their degrees with neighbors.

We further remark that our proposed approach is different from existing algorithms that employ the useful push-sum technique, which requires to be right (rather than left) stochastic, i.e., is required to satisfy instead^{1}

(4) |

It will be illustrated in the simulations (later in Fig. 3 of Part II [2]) that the use of a left-stochastic combination policy and the adapt-then-combine structure in our approach lead to more efficient communications, and also to a stable performance over a wider range of step-sizes than right-stochastic policies used in the push-sum implementations [44, 45, 46, 47, 40]. However, the difference in the nature of the combination matrix (left vs. right-stochastic) complicates the convergence analysis and requires a completely different convergence analysis approach from [44, 45, 46, 47, 40].

In this Part I we derive the exact diffusion algorithm, while in Part II [2] we establish its convergence properties and prove its stability superiority over the EXTRA algorithm. This article is organized as follows. In Section II we review the standard diffusion algorithm, introduce locally-balanced left-stochastic combination policies, and establish several of their properties. In Section III we identify the source of bias in standard diffusion implementations. In Section IV we design the exact diffusion algorithm to correct for the bias. In Section V we illustrate the necessity of the locally-balanced condition on the combination policies by showing that divergence can occur if it is not satisfied. Numerical simulations are presented in Section VI.

Notation: Throughout the paper we use to denote a diagonal matrix consisting of diagonal entries , and use to denote a column vector formed by stacking . For symmetric matrices and , the notation or denotes is positive semi-definite. For a vector , the notation denotes that each element of is non-negative, while the notation denotes that each element of is positive. For a matrix , we let denote its range space, and denote its null space. The notation .

## Ii Diffusion and combination policies

### Ii-a Standard Diffusion Strategy

To proceed, we will consider a more general optimization problem than (1) by introducing a weighted aggregate cost of the form:

(5) |

for some positive coefficients . Problem (1) is a special case when the are uniform, i.e., , in which case . Note also that the aggregate cost is strongly-convex when is strongly-convex.

To solve problem (5) over a connected network of agents, we consider the standard diffusion strategy [15, 13, 14]:

(6) | ||||

(7) |

where are positive step-sizes, and the are nonnegative combination weights satisfying

(8) |

Moreover, denotes the set of neighbors of agent , and denotes the gradient vector of relative to . It follows from (8) that is a left-stochastic matrix. It is assumed that the network graph is connected, meaning that a path with nonzero combination weights can be found linking any pair of agents. It is further assumed that the graph is strongly-connected, which means that at least one diagonal entry of is non-zero [14] (this is a reasonable assumption since it simply requires that at least one agent in the network has some confidence level in its own data). In this case, the matrix will be primitive. This implies, in view of the Perron-Frobenius theorem [48, 14], that there exists an eigenvector satisfying

(9) |

We refer to as the Perron eigenvector of . Next, we introduce the vector

(10) |

where is the weight associated with in (5). Let the constant scalar be chosen such that

(11) |

Remark 1. (Scaling) Condition (11) is not restrictive and can be satisfied for any left-stochastic matrix through the choice of the parameter and the step-sizes. Note that should satisfy

(12) |

for all . To make the expression for independent of , we parameterize (select) the step-sizes as

(13) |

for some small . Then, , which is independent of , and relation (11) is satisfied.

Rermark 2. (Perron entries) Expression (13) suggests that agent needs to know the Perron entry in order to run the diffusion strategy (6)–(7). As we are going to see in the next section, the Perron entries are actually available beforehand and in closed-form for several well-known left-stochastic policies (see, e.g., expressions (19), (23), and (28) further ahead). For other left-stochastic policies for which closed-form expressions for the Perron entries may not be available, these can be determined iteratively by means of the power iteration — see, e.g., the explanation leading to future expression (39).

### Ii-B Combination Policy

Result (14) is a reassuring conclusion: it ensures that the squared-error is small whenever is small; moreover, the result holds for any left-stochastic matrix. Moving forward, we will focus on an important subclass of left-stochastic matrices, namely, those that satisfy a mild local balance condition (we shall refer to these matrices as balanced left-stochastic policies)[49]. The balancing condition turns out to have a useful physical interpretation and, in addition, it will be shown to be satisfied by several widely used left-stochastic combination policies. The local balance condition will help endow networks with crucial properties to ensure exact convergence to without any bias. In this way, we will be able to propose distributed optimization strategies with exact convergence guarantees for this class of left-stochastic matrices, while earlier exact convergence results are limited to (the less practical) right-stochastic or doubly-stochastic policies; these choices face implementation difficulties for the reasons explained before, which is the main motivation for focusing on left-stochastic policies in our treatment.

###### Definition 1 (Locally balanced Policies).

Let denote the Perron eigenvector of a primitive left-stochastic matrix , with entries . Let correspond to the diagonal matrix constructed from . The matrix is said to satisfy a local balance condition if it holds that

(15) |

or, equivalently, in matrix form:

(16) |

Relations of the form (15) are common in the context of Markov chains. They are used there to model an equilibrium scenario for the probability flux into the Markov states [50, 51], where the represent the transition probabilities from states to and the denote the steady-state distribution for the Markov chain.

We provide here an interpretation for (15) in the context of multi-agent networks by considering two generic agents, and , from an arbitrary network, as shown in Fig. 1. The coefficient is used by agent to scale information arriving from agent . Therefore, this coefficient reflects the amount of confidence that agent has in the information arriving from agent . Likewise, for . Since the combination policy is not necessarily symmetric, it will hold in general that . However, agent can re-scale the incoming weight by , and likewise for agent , so that the local balance condition (15) requires each pair of rescaled weights to match each other. We can interpret to represent the (fractional) amount of information flowing from to and to represent the price paid by agent for that information. Expression (15) is then requiring the information-cost benefit to be equitable across agents.

It is worth noting that the local balancing condition (15) is satisfied by several important left-stochastic policies, as illustrated in four examples below. Thus, let for agent . Then condition (11) becomes

(17) |

where .

Policy 1 (Hastings rule) The first policy we consider is the Hastings rule. Given and , we select as [14, 52]:

(18) |

where (the number of neighbors of agent ). It can be verified that is left-stochastic, and that the entries of its Perron eigenvector are given by

(19) |

Let

(20) |

With (18) and (19), it is easy to verify that

(21) |

If , it is obvious that (15) holds. If , then . In this case, .

Furthermore, we can also verify that when and are given, are generated through (18), and is chosen as in (20), then condition (11) is satisfied.

Policy 2 (Averaging rule) The second policy we consider is the popular average combination rule where is chosen as

(22) |

The entries of the Perron eigenvector are given by

(23) |

With (22) and (23), it clearly holds that

(24) |

which implies (15).

We can further verify that when is set as

(25) |

for some positive constant step-size and is set as

(26) |

then condition (11) will hold.

Policy 3 (Relative-degree rule) The third policy we consider is the relative-degree combination rule [53] where is chosen as

(27) |

and the entries of the Perron eigenvector are given by

(28) |

With (27) and (28), it clearly holds that

(29) |

which implies (15).

Policy 4 (Doubly stochastic policy) If matrix is primitive, symmetric, and doubly stochastic, its Perron eigenvector is . In this situation, the local balance condition (15) holds automatically.

Furthermore, if we assume each agent employs the step-size for some positive constant step-size , it can be verified that condition (11) holds with

(32) |

There are various rules to generate a primitive, symmetric and doubly stochastic matrix. Some common rules are the Laplacian rule, maximum-degree rule, Metropolis rule and other rules that listed in Table 14.1 in [14].

Policy 5 (Other locally-balanced policies) For other left-stochastic-policies for which closed-form expressions for the Perron entries need not be available, the Perron eigenvector can be learned iteratively to ensure that the step-sizes end up satisfying (13). Before we explain how this can be done, we remark that since the combination matrix is left-stochastic in our formulation, the power iteration employed in push-sum implementations cannot be applied since it works for right-stochastic policies. We proceed instead as follows.

Since is primitive and left-stochastic, it is shown in [54, 14] that

(33) |

This relation also implies

(34) |

Now let be the -th column of the identity matrix . Furthermore, let each agent keep an auxiliary variable with each initialized to . We also introduce

(35) | ||||

(36) |

By iterating according to

(37) |

we have

(38) |

Since , it can be verified that . Substituting into (II-B), we have In summary, it holds that

(39) |

where is the -th entry of the vector . Therefore, if we set

(40) |

then it follows that

(41) |

We finally note that the quantity that appears in the denominator of (40) can be guaranteed non-zero. This can be seen as follows. From the power iteration (37), we have

(42) |

Since for any and the combination matrix has non-negative entries, we conclude that for . In addition, focusing on the -th entry, we have

(43) |

where the inequality (*) holds because and . Since , if we let , i.e., each agent assigns positive weight to itself, we have

(44) |

In other words, the condition can guarantee the positiveness of . This condition is not restrictive because, for example, we can replace the power iteration (42) by

(45) |

where we are using the coefficients instead of . This is possible because the matrices and are both left-stochastic and have the same Perron vector . Note that no matter whether is zero or not.

We illustrate in Fig. 2 the relations among the classes of symmetric doubly-stochastic, balanced left-stochastic, and left-stochastic combination matrices. It is seen that every symmetric doubly-stochastic matrix is both left-stochastic and balanced. We indicated earlier that the EXTRA consensus algorithm was derived in [38] with exact convergence properties for symmetric doubly-stochastic matrices. Here, in the sequel, we shall derive an exact diffusion strategy with exact convergence guarantees for the larger class of balanced left-stochastic matrices (which is therefore also applicable to symmetric doubly-stochastic matrices). We will show in Part II[2] that the exact diffusion implementation has a wider stability range than EXTRA consensus; this is a useful property since larger step-sizes can be used to attain larger convergence rates.

Remark 3. (Convergence guarantees) One may wonder whether exact convergence can be guaranteed for the general left-stochastic matrices that are not necessarily balanced (i.e., whether the convergence property can be extended beyond the middle elliptical area in Fig. 2). It turns out that one can provide examples of combination matrices that are left-stochastic (but not necessarily balanced) for which exact convergence occurs and others for which exact convergence does not occur (see, e.g., the examples in Section V and Figs. 8 and 9). In other words, exact convergence is not always guaranteed beyond the balanced class. This conclusion is another useful contribution of this work; it shows that there is a boundary inside the set of left-stochastic matrices within which convergence can be always guaranteed (namely, the set of balanced matrices).

It is worth noting that the recent works [47, 46] extend the consensus-based EXTRA method to the case of directed networks by employing a push-sum technique. These extensions do not require the local balancing condition but they establish convergence only if the step-size parameter falls within an interval where and are two positive constants. However, it is not proved in these works whether this interval is feasible, i.e., whether . In fact, we will construct examples in Section V for which both exact diffusion and push-sum EXTRA will diverge for any step-size . In other words, both exact diffusion and EXTRA methods need not work well for directed networks. This is a disadvantage in comparison with DIGing-based methods [39, 40, 41, 42, 43].

In summary, when locally-balanced policies is employed, exact diffusion is more communication efficient and also more stable than other techniques including DIGing methods and EXTRA. However, just like EXTRA, the exact diffusion strategy is applicable to undirected (rather than directed) graphs.

### Ii-C Useful Properties

We now establish several useful properties for primitive left-stochastic matrices that satisfy the local balance condition (15). These properties will be used in the sequel.

###### Lemma 1 (Properties of ).

When satisfies the local balance condition (15), it holds that the matrix is primitive, symmetric, and doubly stochastic.

###### Proof.

With condition (15), the symmetry of is obvious. To check the primitiveness of , we need to verify two facts, namely, that: (a) at least one diagonal entry in is positive, and (b) there exists at least one path with nonzero weights between any two agents. It is easy to verify condition (a) because is already primitive and . For condition (b), since is connected and all diagonal entries of are positive, then if there exists a path with nonzero coefficients linking agents and under , the same path will continue to exist under . Moreover, since all diagonal entries of are positive, then the same path will also exist under . Finally, is doubly stochastic because

(46) |

(47) |

###### Lemma 2 (Nullspace of ).

When satisfies the local balance condition (15), it holds that is symmetric and positive semi-definite. Moreover, it holds that

(48) |

where denotes the null space of its matrix argument.

###### Proof.

Let denote the -th largest eigenvalue of . Recall from Lemma 1 that is primitive and doubly stochastic. Therefore, according to Lemma F.4 from [14] it holds that

(49) |

It follows that the eigenvalues of are non-positive so that .

Note further from (49) that the matrix has a single eigenvalue at one with multiplicity one. Moreover, from (47) we know that the vector is a right-eigenvector associated with this eigenvalue at one. Based on these two facts, we have

(50) |

for any constant . Relation (50) is equivalent to

(51) |

which confirms (48).

###### Corollary 1 (Nullspace of ).

Let and . When satisfies the local balance condition (15), it holds that

(52) |

Moreover, for any block vector in the nullspace of with entries , it holds that

(53) |

###### Proof.

Since has a single eigenvalue at with multiplicity one, we conclude that will have an eigenvalue at with multiplicity . Next we denote the columns of the identity matrix by where . We can verify that is a right-eigenvector associated with the eigenvalue because

(54) |

Now since any two vectors in the set are mutually independent, we conclude that

(55) |

These equalities establish (1). From (1) we can also conclude (53) because

(56) |

from some . The direction “” of (53) is obvious.

###### Lemma 3 (Real eigenvalues).

When satisfies the local balance condition (15), it holds that is diagonalizable with real eigenvalues in the interval , i.e.,

(57) |

where , and

(58) |

###### Proof.

According to the local balance condition (16), is symmetric. Using the fact that is diagonal, it holds that

(59) |

which shows that the matrix on the left-hand side is symmetric. Therefore, can be decomposed as

(60) |

where is an orthogonal matrix and is a real diagonal matrix. From (60), we further have that

(61) |

If we let , we reach the decomposition (57). Moreover, since is a primitive left-stochastic matrix, according to Lemma F.4 in [14], the eigenvalues of satisfy (58).

For ease of reference, we collect in Table I the properties established in Lemmas 1 through 3 for balanced primitive left-stochastic matrices .

Properties of balanced primitive left-stochastic matrices |

is diagonalizable with real eigenvalues in ; |

has a single eigenvalue at ; |

is symmetric, primitive, doubly-stochastic; |

is positive semi-definite; |

; |

. |

## Iii Penalized Formulation of Diffusion

In this section, we employ the properties derived in the previous section to reformulate the unconstrained optimization problem (5) into the equivalent constrained problem (74), which will be solved using a penalized formulation. This derivation will help clarify the origin of the bias from (14) in the standard diffusion implementation.

### Iii-a Constrained Problem Formulation

To begin with, note that the unconstrained problem (5) is equivalent to the following constrained problem:

(62) |

Now we introduce the block vector and

(63) |

With (53) and (63), problem (III-A) is equivalent to

(64) |

From Lemma 2, we know that is symmetric and positive semi-definite. Therefore, we can decompose

(65) |

where is a non-negative diagonal matrix and is an orthogonal matrix. If we introduce the symmetric square-root matrix

(66) |

then it holds that

(67) |

Let so that

(68) |

###### Lemma 4 (Nullspace of ).

With defined as in (66), it holds that

(69) |

###### Proof.

To prove , it is enough to prove

(70) |

Indeed, notice that

(71) |

The reverse direction “” in (70) is obvious.

### Iii-B Penalized Formulation

There are many techniques to solve constrained problems of the form (74). One useful and popular technique is to add a penalty term to the cost function and to consider instead a penalized problem of the form:

(75) |

where is a penalty parameter. Problem (75) is not equivalent to (74) but is a useful approximation. The smaller the value of is, the closer the solutions of problems (74) and (75) become to each other [55, 56, 57]. We now verify that the diffusion strategy (6)–(7) follows from applying an incremental technique to solving the approximate penalized problem (75), not the real problem (74). It will then become clear that the diffusion estimate cannot converge to the exact solution of problem (5) (or (74)).

Since (68) holds, problem (75) is equivalent to

(76) |

This is an unconstrained problem, which we can solve using, for example, a diagonally-weighted incremental algorithm, namely,

(77) |

The above recursion can be simplified as follows. Assume we select

(78) |

where is the same constant used in relation (11). Recall from (20), (26), (31) and (32) that and hence . Moreover, from the definition of in (63), we have