Efficient Information Aggregation Strategies for Distributed Control and Signal Processing
###### Abstract

This thesis will be concerned with distributed control and coordination of networks consisting of multiple, potentially mobile, agents. This is motivated mainly by the emergence of large scale networks characterized by the lack of centralized access to information and time-varying connectivity. Control and optimization algorithms deployed in such networks should be completely distributed, relying only on local observations and information, and robust against unexpected changes in topology such as link failures.

We will describe protocols to solve certain control and signal processing problems in this setting. We will demonstrate that a key challenge for such systems is the problem of computing averages in a decentralized way. Namely, we will show that a number of distributed control and signal processing problems can be solved straightforwardly if solutions to the averaging problem are available.

The rest of the thesis will be concerned with algorithms for the averaging problem and its generalizations. We will (i) derive the fastest known averaging algorithms in a variety of settings and subject to a variety of communication and storage constraints (ii) prove a lower bound identifying a fundamental barrier for averaging algorithms (iii) propose a new model for distributed function computation which reflects the constraints facing many large-scale networks, and nearly characterize the general class of functions which can be computed in this model.

Efficient Information Aggregation Strategies for Distributed Control and Signal Processing

by

Alexander Olshevsky

B.S., Mathematics, Georgia Institute of Technology (2004)
B.S., Electrical Engineering, Georgia Institute of Technology (2004)
M.S., Electrical Engineering and Computer Science, Massachusetts Institute of Technology (2006)

Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2010

The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part.

Author
Department of Electrical Engineering and Computer Science
August 30, 2010

Certified by
John N. Tsitsiklis
Clarence J Lebel Professor of Electrical Engineering
Thesis Supervisor

Accepted by
Terry P. Orlando
Professor of Electrical Engineering and Computer Science,

Efficient Information Aggregation Strategies for Distributed Control and Signal Processing

by

Alexander Olshevsky

[]

\@normalsize

Submitted to the Department of Electrical Engineering and Computer Science

on August 30, 2010, in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

\@normalsize

Thesis Supervisor: John N. Tsitsiklis
Title: Clarence J Lebel Professor of Electrical Engineering

### Acknowledgments

I am deeply indebted to my advisor, John Tsitsiklis, for his invaluable guidance and tireless efforts in supervising this thesis. I have learned a great deal from our research collaborations. I have greatly benefitted from his constantly insightful suggestions and his subtle understanding when a technique is powerful enough to solve a problem and when it is bound to fail.

I would like to thank my thesis committee members Emilio Frazzoli and Asuman Ozdaglar for their careful work in reading this thesis, which was considerably improved as a result of their comments.

I would like to thank the many people with whom I’ve had conversations which have helped me in the course of my research: Amir Ali Ahmadi, Pierre-Alexandre Bliman, Vincent Blondel, Leonid Gurvits, Julien Hendrickx, Ali Jadbabaie, Angelia Nedic, and Pablo Parrilo.

I would like to thank my family for the help they have given me over the years. Most importantly, I want to thank my wife Angela for her constant encouragement and support.

This research was sponsored by the National Science Foundation under an NSF Graduate Research Fellowship and grant ECCS-0701623.

## Chapter 1 Introduction

This thesis is about certain control and signal processing problems over networks with unreliable communication links. Some motivating scenarios are:

1. Distributed estimation: a collection of sensors are trying to estimate an unknown parameter from observations at each sensor.

2. Distributed state estimation: a collection of sensors are trying to estimate the (constantly evolving) state of a linear dynamical system from observations of its output at each sensor.

3. Coverage control: A group of robots wish to position themselves so as to optimally monitor an environment of interest.

4. Formation control: Several UAVs or vehicles are attempting to maintain a formation against random disturbances to their positions.

5. Distributed task assignment: allocate a collection of tasks among agents with individual preferences in a distributed way.

6. Clock synchronization. A collection of clocks are constantly drifting apart, and would like to maintain a common time as much as this is possible. Various pairs of clocks can measure noisy versions of the time offsets between them.

We will use “nodes” as a common word for sensors, vehicles, UAVs, and so on. A key assumption we will be making is that the communication links by means of which the nodes exchange messages are unreliable. A variety of simple and standard techniques can be used for any of the above problems if the communication links never fail. By contrast, we will be interested in the case when the links may “die” and “come online” unpredictably. We are interested in algorithms for the above problems which work even in the face of this uncertainty.

It turns out that a key problem for systems of this type is the problem of computing averages, described next. Nodes each begin with a real number . We are given a discrete sequence of times and at each time step a communication graph is exogenously provided by “nature” determining which nodes can communicate: node can send messages to node at time if and only if . For simplicity, no restrictions are placed on the messages nodes can send to each other, and in particular, the nodes may broadcast their initial values to each other. The nodes need to compute , subject to as few assumptions as possible about the communication sequence . We will call this the averaging problem, and we will call any algorithm for it an averaging algorithm.

The averaging problem is key in the sense that a variety of results are available describing how to use averaging algorithms to solve many other distributed problems with unreliable communication links. In particular, for each of the above problems (distributed estimation, distributed state estimation, coverage control, formation control, clock synchronization), averaging algorithms are a cornerstone of the best currently known solutions.

The remainder of this chapter will begin by giving a historical survey of averaging algorithms, followed by a list of the applications of averaging, including the problems on the previous page.

### 1.1 A history of averaging algorithms

The first paper to introduce distributed averaging algorithms was by DeGroot [39]. DeGroot considered a simple model of “attaining agreement:” individuals are on a team or committee and would like to come to a consensus about the probability distribution of a certain parameter . Each individual begins with a probability distribution which they believe is the correct distribution of . For simplicity, we assume that takes on a value in some finite set , so that each can be described by numbers.

The individuals now update their probability distributions as a result of interacting with each other. Letting be the distribution believed by at time , the agents update as , with the initialization . Here, are weights chosen by the individuals. Intuitively, people may give high weights to a subset of the people they interact with, for example if a certain person is believed to be an expert in the subject. On the other hand, some weights may be zero, which corresponds to the possibility that some individual ignore each other’s opinions. It is assumed, however, that all the weights are nonnegative, and add up to for every . Note that the coefficients are independent of time, corresponding to a “static” communication pattern: individuals do not change how much they trust the opinions of others.

DeGroot gave a condition for these dynamics to converge, as well as a formula for the limiting opinion; subject to some natural symmetry conditions, the limiting opinion distribution will equal the average of the initial opinion distributions . DeGroot’s work was later extended by Chatterjee and Seneta [33] to the case where the weights vary with time. The paper [33] gave some conditions on the time-varying sequence required for convergence to agreement on a single distribution among the individuals.

The same problem of finding conditions on necessary for agreement was addressed in the works [94, 95, 17], which were motivated by problems in parallel computation. Here, the problem was phrased slightly differently: processors each begin with a number stored in memory, and the processors need to agree on a single number within the convex hull . This is accomplished by iterating as . This problem was a subroutine of several parallel optimization algorithms [94]. It is easy to see that it is equivalent to the formulation in terms of probability distributions addressed by DeGroot [39] and Chatterjee and Seneta [33].

The works [94, 95, 17] gave some conditions necessary for the estimates to converge to a common value. These were in the same spirit as [33], but were more combinatorial than [33] being expressed directly in terms of the coefficients . These conditions boiled down to a series of requirements suggesting that the agents have repeated and nonvanishing influence over each other. For example, the coefficients should not be allowed to decay to zero, and the graph sequence containing the edges for which needs to be “repeatedly connected.”

Several years later, a similar problem was studied by Cybenko [38] motivated by load balancing problems. In this context, processors each begin with a certain number of jobs . The variable can only be an integer, but assuming a large number of jobs in the system, this assumption may be dispensed with. The processors would like to equalize the load. To that end, they pass around jobs: processors with many jobs try to offload their jobs on their neighbors, and processors with few job ask for more requests from their neighbors. The number of jobs of processor behave approximately as , which subject to some conditions on the coefficients , may be viewed as a special case of the iterations considered in [33, 94, 95, 17].

Cybenko showed that when the neighborhood structure is a hypercube (i.e., we associate with each processor a string of bits, and whenever processors and differ by at most bit), the above processes may converge quite fast: for some natural simple processes, an appropriately defined convergence time is on the order of iterations.

Several years later, a variation on the above algorithms was studied by Vicsek et al. [97]. Vicsek et al. simulated the following scenario: particles were placed randomly on a torus with random initial direction and constant velocity. Periodically, each particle would try to align its angle with the angles of all the particles within a certain radius. Vicsek et al. reported that the end result was that the particles aligned on a single direction.

The paper [55] provided a theoretical justification of the results in [97] by proving the convergence of a linearized version of the update model of [97]. The results of [55] are very similar to the results in [94, 95], modulo a number of minor modifications. The main difference appers to be that [94, 95] makes certain assumptions on the sequence that are not made in [55]; these assumptions, however, are never actually used in the proofs of [94, 95]. We refer the reader to [16] for a discussion.

The paper [55] has created an explosion of interest in averaging algorithms, and the subsequent literature expanded in a number of directions. It is impossible to give a complete account of the literature since [55] in a reasonable amount of space, so we give only a brief overview of several research directions.

Convergence in some natural geometric settings. One interesting direction of research has been to analyze the convergence of some plausible geometric processes, for which there is no guarantee that any of the sufficient conditions for consensus (e.g. from [95] or [55]) hold. Notable in this direction was [37], which proved convergence of one such process featuring all-to-all communication with decaying strength. In a different direction, [30, 29] give tight bounds on the convergence of averaging dynamics when the communication graph corresponds to nearest neighbors of points in .

General conditions for averaging. A vast generalization of the consensus conditions in [95] and [55] was given in [76]. Using set-valued generalizations of the Lyapunov functions in [95, 55], it was shown that a large class of possible nonlinear maps lead to consensus. Further investigation of these results was given in [5] and [73].

Quantized consensus. The above models assume that nodes can transmit real numbers to each other. It is natural to consider quantized versions of the above schemes. One may then ask about the tradeoffs between storage and performance. A number of papers explored various aspects of this tradeoffs. In [57], a simple randomized scheme for achieving approximate averaging was proposed. Further research along the same lines can be found in [106] and [43]. A dynamic scheme which allows us to approximately compute the average as the nodes communicate more and more bits with each other can be found in [24].

We will also consider this issue in this thesis, namely in Chapters 7 and 8, which are based on the papers [79] and [53], respectively. In Chapter 7, we will give a recipe for quantizing any linear averaging scheme to compute the average approximately. In Chapter 8, we will consider the problem of computing the averaging approximately with a deterministic algorithm, when each node can store only a constant number of bits for each link it maintains.

Averaging with coordinates on geometric random graphs. Geographic random graphs are common models for sensor networks. It is therefore of interest to try to specialize results for averaging to the case of geometric random graphs. Under the assumption that every node knows its own exact coordinates, an averaging algorithm with a lower than expected averaging cost was developed in [40]. Further research in [13, 96] reduced the energy cost even further. Substantial progress towards removing the assumption that each node knows its coordinates was recently made in [83].

Design of fast averaging algorithms on fixed graphs. It is interesting to consider the fastest averaging algorithm for a given, fixed graph. It is hoped that an answer would give some insight into averaging algorithms: what their optimal speed is, how it relates to graph structure, and so on. In [100], the authors showed how to compute optimal symmetric linear averaging algorithms with semidefinite programming. Some further results for specific graphs were given in [21, 22]. Optimization over a larger class of averaging algorithms was consider in [88] and also in [61].

Analysis of social networks. We have already mentioned the work of DeGroot [39] which was aimed at modeling the interactions of individual via consensus-like updates. A number of recent works has taken this line of analysis further by analyzing how the combinatorial structure of social networks affects the outcome. In particular, [50] studied how good social networks are at aggregating distributed information in terms of various graph-cut related quantities. The recent work [4] quantified the extent to which “forceful” agents which are not influenced by others interefere with information aggregation.

### 1.2 Applications of averaging

We give an incomplete list of the use of averaging algorithms in various applications.

1. Consider the following distributed estimation problem: a sensor network would like to estimate some unknown vector of parameters. At a discrete set of times, some sensors make noise-corrupted measurements of a linear function of the unknown vector. The sensors would like to combine the measurements that are coming in into a maximum likelihood estimate.

We discuss a simpler version of this problem in the following chapter, and describe some averaging-based algorithms. In brief, other known techniques (flooding, fusion along a spanning tree) suffer from either high storage requirements or lack of robustness to link failures. The use of averaging-based algorithms allows us to avoid these downfalls, as we will explain in Chapter 2. For some literature on this subject, we refer the reader to [101, 102] and [27].

2. Distributed state estimation: a collection of sensors are trying to estimate the (constantly evolving) state of a linear dynamical system. Each sensor is able to periodically make a noise corrupted measurement of the system output. The sensors would like to cooperate on synthesizing a Kalman filter estimate.

There are a variety of challenges involving in such a problem, not least of which is the delay involved in a receiving at one node the measurements from other nodes which are many hops away. Several ideas have been presented in the literature for solving this sort of problem based on averaging algorithms. We refer the reader to [84, 8, 85, 26, 42]

3. Coverage control is the problem of optimally positioning a set of robots to monitor an area. A typical case involves a polygon-shaped area along with robots which can measure distances to the boundary as well as to each other. Based on these distances, it is desirable to construct controllers which cover the entire area, yet assign as little area as possible to each robot. A common addition to this setup involves associating a number to each point in the polygon, representing the importance of monitoring this point. The robots then optimize a corresponding objective function which weights regions according to the importance of the points in them.

It turned out that averaging algorithms have proven very useful in designing distributed controllers for such systems. We refer the reader to [46] for the connection between distributed controllers for these systems and averaging over a certain class of graphs defined by Voronoi diagrams. A similar approach was adopted in [92]. Note, also, the related paper [71] and the extension to nonuniform coverage in [72].

4. Formation control is the problem of maintain a set formation, defined by a collection of relative distances, against random or adversarial disturbances. Every once in a while pairs of agents manage to measure the relative offset between them. The challenge is for the agents to use these measurements and take actions so tht in the end everyone gets into formation.

We discuss this problem in greater detail in Chapter 2, where we explain several formation control ideas originating in [86]. Averaging theorems can show the possibility of working formation control in this setting, subject only to very intermittent communication.

5. The task assignment problem consists in distributing a set of tasks among a collection of agents. This arises, for example, in the case of a group of aircraft or robots who would like to make decisions autonomously without communication with a common base. A typical example is dividing a list of locations to be monitored among a group of aircraft.

In such cases, various auction based methods are often used to allocate tasks. Averaging and consensus algorithms provide the means by which these auctions are implemented in a distributed way; we refer the reader to the papers [32, 99] for details.

6. Consider a collection of clocks which are constantly drifting apart. This is a common scenario, because clocks drift randomly depending on various factors within their environment (e.g. temperature), and also because clocks have a (nonzero) drift relative to the “true” time. Maintaining a common time as much as possible is important for a number of estimation problems (for example, direction of arrival problems).

An important problem is to design distributed protocols to keep the clocks synchronized. These try to keep clock drift to a minimum, at least between time periods when at outside source can inform each node of the correct time. This problem has a natural similarity to averaging, except that one does not care very much getting the average right, but rather agreement on any time will do. Moreover, the constantly evolving times present a further challenge.

A natural approach, explored in some of the recent literature, is to adopt averaging techniques to work in this setting. We refer the reader to [90, 25, 12, 44, 104].

### 1.3 Main contributions

This thesis is devoted to the analysis of the convergence time of averaging schemes and to the degradation in performance as a result of quantized communication. What follows is a brief summary of our contributions by chapter.

Chapters 2 and 3 are introductory. We begin with Chapter 2 which seeks to motivate the practical use of averaging algorithms. We compare averaging algorithms to other ways of aggregating information, such as flooding and leader-election based methods, and discuss the various advantages and disadvantages. Our main point is that schemes based on distributed averaging posess two unique strenghts: robustness to link failures and economical storage requirements at each node.

Next, in Chapter 3, we discuss the most elementary known results on the convergence of averaging methods. The rest of this thesis will be spent on improving and refining the basic results in this chapter.

In Chapter 4 we give an exposition of the first polynomial-time convergence bound on the convergence time of averaging algorithms. Previously known bounds, such as those described in Chapter 3, took exponentially many steps in the number of nodes to converge, in the worst case. In the subsequent Chapter 5, we give an averaging algorithm whose convergence time scales as steps on nearly arbitrary time-varying graph sequences. This is the currently best averaging algorithm in terms of convergence time bounds.

We next wonder if it is possible to design averaging algorithms which improve on this quadratic scaling. In Chapter 6, we prove that it is in fact impossible to beat the time steps bound within a large class of (possibly nonlinear) update schemes. The schemes we consider do not exhaust all possible averaging algorithms, but they do encompass the majority of averaging schemes proposed thus far in the literature.

We then move on to study the effect of quantized communication and storage. Chapter 7 gives a recipe for quantizing any linear averaging scheme. The quantization performs averaging while storing and transmitting only bits. It is shown that this quantization preserves the convergence time bounds of the scheme, and moreover allows one to compute the average to any desired accuracy: by picking large (but not dependent on ), one can make the final result be as close to the average as desired.

In Chapter 8, we investigate whether it is possible to push down the storage down even futher; in particular, we show how to the average may be approximately with a deterministic algorithm in which each node stores only a constant number of bits per every connection it maintains. An algorithm for fixed graphs is given; the dynamic graph case remains an open question.

Finally, Chapter 9 tackles the more general question: which functions can be computed with a decentralized algorithm which uses a constant number of bits per link? The chapter assumes a consensus-like termination requirement in which the nodes only have to get the right answer eventually, but are not required to know when they have done so. The main result is a nearly tight characterization of the functions which can be computed deterministically in this setting.

## Chapter 2 Why averaging?

Our goal in this chapter is to motivate the study of distributed averaging algorithms. We will describe two settings in which averaging turns out to be singularly useful. The first is an estimation problem in a sensor network setting; we will describe an averaging-based solution which avoids the pitfalls which plague alternative schemes. The second is a formation maintenance problem; we will show how basic theorems on averaging allow us to establish the satisfactory performance of some formation control schemes.

### 2.1 A motivating example: distributed estimation in sensor networks

Consider a large collection of sensors, that want to estimate an unknown parameter . Some of these sensors are able to measure a noise corrupted version of ; in particular, all nodes in some subset measure

 xi=θ+wi.

We will assume, for simplicity, that that the noises are jointly Gaussian and independent at different sensors. Moreover, only node knows the statistics of its noise .

It is easy to see that the maximum likelihood estimate is given by

 ^θ=∑i∈Sxi/σ2i∑i∈S1/σ2i.

Note that if (i.e. every node makes a measurement), and all the variances are equal, the maximum likelihood estimate is just the average .

The sensors would like to compute in a distributed way. We do not assume the existence of a “fusion center” to which the sensors can transmit measurements; rather, the sensors have to compute the answer by exchanging messages with their neighbors and performing computations.

The sensors face an additional problem: there are communication links available through which they can exchange messages, but these links are unreliable. In particular, links fail and come online in unpredictable ways. For example, there is no guarantee that any link will come online if the sensors wait long enough. It is possible for a link to be online for some time, and then fail forever. Figure 2-1 shows an example of what may happen: at any given time, only some pairs of nodes may exchange messages, and the network is effectively split into disconnected clusters.

More concretely, we will assume a discrete sequence of times during which the sensors may exchange messages. At time , sensor may send a message to its neighbors in the undirected graph . We will also assume that the graph includes all self loops . The problem is to devise good algorithms for computing , and to identify minimal connectivity assumptions on the sequence under which such a computation is possible.

#### 2.1.1 Flooding

We now describe a possible answer. It is very plausible to make an additional assumption that sensors possess unique identifiers; this is the case in almost any wireless system. The sensors can use these identifiers to “flood” the network so that eventually, every sensor knows every single measurement that has been made.

At time , sensor sends its own triplet to each of its neighbors. Each sensor stores all the messages it has received. Moreover, a sensor maintains a “to broadcast” queue, and each time it hears a message with an it has not heard before, it adds it to the tail of the queue. At times , each sensor broadcasts the top message from its queue.

If is constant with time, and connected, then eventually each sensor learns all the measurements that have been made. Once that happens, each sensor has all the information it needs to compute the maximum likelihood estimate. Moreover, the sensors do not even need to try to detect whether they have learned everything; each sensor can simply maintain an estimate of , and revise that estimate each time it learns of a new measurement.

If is not fixed, flooding can still be expected to work. Indeed, each time a link appears, there is opportunity for a piece of information to be learned. One can show that subject to only very minimal requirements on connectivity, every sensor eventually does learn every measurement.

Let us state a theorem to this effect. We will use the notation to mean the graph obtained by forming the union of the edge sets of , i.e.

 ⋃t∈XG(t)=({1,…,n},⋃t∈XE(t)).

A relatively light connectivity assumption is the following.

###### Assumption 2.1.

(Connectivity) The graph

 ∪s≥tG(s)

is connected for every .

In words, this assumption says that the graph sequence has enough edges for connectivity, and that moreover this remaisn true after some finite set of graphs is removed from the sequences.

###### Theorem 2.1.

If Assumption 2.1 holds, then under the flooding protocol every node eventually learns each triplet .

###### Proof.

(Sketch). Suppose that some triplet is not learned by node . Let be the nonempty set of nodes that do learn this triplet; one can easily argue that the number of edges in the graphs between and is finite. But this contradicts Assumption 2.1. ∎

It is possible to relax the assumption of this theorem: actually only needs to be connected for a sufficiently long but finite time interval.

The problem with flooding, however, lies with its storage requirements: each sensor needs to store pieces of information, i.e., it needs to store a list of id’s whose measurements it has already seen. This means that the total amount of storage throughout the network is at least on the order of .

We count the storage requirements of the numbers as constant. When dealing with estimation problems, it is convenient to assume that are real numbers, and we will maintain this technical assumption for now. Nevertheless, in practice, these numbers will be truncated to some fixed number of bits (independent of ). Thus it makes sense to think of transmitting each of the as incurring a fixed cost independent of .

Thus tracing out the dependence on , we have that at least bits must be stored, in addition to a fixed number of bits at each node to maintain truncated versions of and the estimate . One hopes for the existence of a scheme whose storage requirements scale more gracefully with the number of nodes .

#### 2.1.2 Leader election based protocols

The protocol we outline next has considerably nicer storage requirements. On the other hand, it will require some stringent assumptions on connectivity. We describe it next for the case of a fixed communication graph, i.e., when the graph sequence does not depend on .

First, the sensors elect one of them as a leader. There are various protocols for doing this. If sensors have unique identifiers they may pick (in a distributed way) the sensor with the largest or smallest id as the leader. Even in the absence of identifiers, there are randomized protocols for leader election (see [7]) which take on the order of the diameter111The diameter of the graph , denoted by , is the largest distance between any two nodes. of the network time steps, provided that messages of size bits can be sent at each time step.

Next, the sensors can build a spanning tree with the leader as the root. For example, each sensor may pick as its parent the node one hop closer to the root. Finally, the sensors may forward all of their information (i.e., their ) to the root, which can compute the answer and forward it back.

Our description of the algorithm is deliberately vague, as the details do not particularly matter (e.g., which leader election algorithm is used). We would like to mention, however, that it is even possible to avoid having the leader learn all the information . For example, once a spanning tree is in place, each node may wait to hear from all of its children, and then forward to the leader a sufficient statistic for the measurements in its subtree.

We state the existence of such protocols as a theorem:

###### Theorem 2.2.

If does not depend on time, i.e., for all , it is possible to compute with high probability in time steps. The nodes need to store and forward messages of size bits in each time step, as well a constant number of real numbers which are smooth functions of the measurements .

Observe that that the theorem allows for the existence of protocols which (only) work with high probability; this is due to the necessarily randomized nature of leader election protocols in the absence of identifiers.

The above theorem is nearly optimal for any fixed graph. In other words, if the communication links are reliable, there is no reason to choose anything but the above algorithm.

On the other hand, if the graph sequence is changing unpredictably, this sort of approach immediately runs into problems. Maintaining a spanning tree appears to be impossible if the graph sequence changes dramatically from step to step, and other approaches are needed.

### 2.2 Using distributed averaging

We now describe a scheme which is both robust to link failures (it only needs Assumption 2.1 to work) and has nice storage requirements (only requires nodes to maintain a constant number of real numbers which are smooth functions of the data ). However, it needs the additional assumption that the graphs are undirected. This condition is often satisfied in practice, for example if the sensors are connected by links whenever they are within a certain distance of each other.

First, let us introduce some notation. Let be the set of neighbors of node in the graph . Recall that we assume self-arcs are always present, i.e., for all , so that for all . Let be the degree of node in .

Let us first describe the scheme for the case where , i.e., every node makes a measurement, and all are the same. In this case, the maximum likelihood estimate is just the average of the numbers : .

The scheme is as follows. Each node sets and updates as

 xi(t+1)=∑j∈Ni(t)aij(t)xj(t), (2.1)

where

 aij(t) = min(1di(t),1dj(t)),    for j∈Ni(t),j≠i = 0,                            otherwise. aii(t) = 1−∑j∈Ni(t)aij.

Then:

###### Proposition 2.1.

If Assumption 2.1 holds and all the graphs are undirected, then

 limt→∞xi(t)=1nn∑i=1xi=^θ.

The above proposition is true because it is a special case of the following theorem:

###### Theorem 2.3.

Consider the iteration

 x(t+1)=A(t)x(t)

where:

1. are doubly stochastic222A matrix is called doubly stochastic if it is nonnegative and all of its rows and columns add up to . matrices.

2. If , then and , and if , then .

3. There is some such that if then .

4. The graph sequence is undirected.

5. Assumption 2.1 on the graph sequence holds.

Then:

 limt→∞xi(t)=1nn∑j=1xj(0),

for all .

We will prove this theorem in Chapter 3. In this form, this theorem is a trivial modification of the results in [70, 28, 76, 52, 20] which themselves are based on the earlier results in [95, 55].

Accepting the truth of this theorem for now, we can conclude that we have described a simple way to compute . Every node just needs to store and update a single real number . Nevertheless, subject to only the weak connectivity Assumption 2.1, every approaches the correct . This scheme thus manages to avoid the downsides that plague flooding and leader election (high storage, lack of robustness to link failures).

Let us describe next how to use this idea in the general case where is a proper subset of and the are not all equal. Each node sets , , and . If the node did not make a measurement, it sets , and leaves undefined. Each node updates as

 xi(t+1) = ∑j∈Ni(t)aij(t)xj(t) yi(t+1) = ∑j∈Ni(t)aij(t)yj(t) zi(t+1) = xi(t)yi(t)

Observe that and we have:

###### Proposition 2.2.

If Assumption 2.1 holds, then

 limt→∞zi(t)=^θ.

###### Proof.

By Theorem 2.1,

 limt→∞xi(t) = (1/n)∑i∈Sxi/σ2i limt→∞yi(t) = (1/n)∑i∈S1/σ2i

and so

 limt→∞zi(t)=limt→∞xi(t)limt→∞yi(t)=∑i∈Sxi/σ2i∑i∈S1/σ2i=^θ.

The punchline is that even this somewhat more general problem can be solved by an algorithm that relies on Theorem 2.1. This solution has nice storage requirements (nodes store only a constant number of real numbers which are smooth functions of ) and is robust to link failures.

### 2.3 A second motivating example: formation control

We now give another application of Theorem 2.1, this time to a certain formation control problem. Our exposition is based on [86].

Suppose that the nodes have real positions in ; the initial positions are arbitrary, and the nodes want to move into a formation characterized by positions in (formations are defined up to translation). This formation is uniquely characterized by the offset vectors . We assume that at every , various pairs of nodes succeed in measuring the offsets . Let be the set of (undirected) edges corresponding to the measurements. The problem is how to use these intermittent measurements to get into the desired formation.

Let us assume for simplicity that our lie in (we will dispense with this assumption shortly). A very natural idea is to perform gradient descent on the function

 ∑(i,j)∈E(t)(xi(t)−xj(t)−rij)2.

This leads to the following control law:

 xi(t+1)=xi(t)−2Δ∑j∈Ni(t)(xi(t)−xj(t)) + 2Δ∑j∈Ni(t)rij, (2.2)

where is the stepsize. Essentially, every node repeatedly “looks around” and moves to a new position depending on the positions of its neighbors and its desired offset vectors .

Defining , the above equations may be rewritten as

 x(t+1)=A(t)x(t)+b(t).

Now let be any translate of the given positions . Then, satisfies

 z=A(t)z+b(t),

because the gradient at equals . Subtracting the two equations we get

 x(t+1)−z=A(t)(x(t)−z).

Observe that if , then the above matrix is nonnegative, symmetric, and has rows that add up to . Applying now Theorem 2.1, we get the following statement:

###### Proposition 2.3.

Suppose that the nodes implement the iteration of Eq. (2.2). If:

1. is the translate of whose average is the same as the average of .

2. The communication graph sequence satisfies Assumption 1 (connectivity).

Then,

 limt→∞xi(t)=z′i,

for all .

The proof is a straightforward application of Theorem 2.1. This theorem tells us that subject to only minimal conditions on connectivity, the scheme of Eq. (2.2) will converge to the formation in question.

We remark that we can replace the assumption that the are real numbers with the assumption that the belong to . In this case, we can apply the control law of Eq. (2.2), which decouples along each component of , and apply Proposition 2.3 to each component of .

Finally, we note that continuous-time versions of these updates may be presented; see [86].

### 2.4 Concluding remarks

Our goal in this chapter has been to explain why averaging algorithms are useful. We have described averaging-based algorithms for estimation and formation problems which are robust to link failures and have very light storage requirements at each node.

Understanding the tradeoff between various types of distributed algorithms is still very much an open question, and the discussion in this chapter has only scratched the surface of it. One might additionally wonder how averaging algorithms perform on a variety of other dimensions: convergence time, energy expenditure, robustness to node failures, performance degradation with noise, and so on.

Most of this thesis will be dedicated to exploring the question of convergence time. In the next chapter we will give a basic introduction to averaging algorithms, and in particular we will furnish a proof of Theorem 2.1. In the subsequent chapters, will turn to the question of designing averaging algorithms with good convergence time guarantees.

## Chapter 3 The basic convergence theorems

Here we begin our analysis of averaging algorithm by proving some basic convergence results. Our main goal is to prove Theorem 2.1 from the previous chapter, as well as some natural variants of it. Almost all of this thesis will be spent refining and improving the simple results obtained by elementary means in this chapter. The results we will present may be found in [94, 95, 17, 55, 70, 20]. We also recommend the paper [76] for a considerable generalization of the results found here. The material presented here appeared earlier in the paper [20] and the M.S. thesis [80].

### 3.1 Setup and assumptions

We consider a set of nodes, each starting with a real number stored in memory. The nodes attempt to compute the average of these numbers by broadcasting these numbers and repeatedly combining them by forming convex combination. We will first only be concerned with the convergnece of this process.

Each node starts with a scalar value . The vector with the values held by the nodes at time , is updated according to the equation , or

 xi(t+1)=n∑j=1aij(t)xj(t), (3.1)

where is a nonnegative matrix with entries , and where the updates are carried out at some discrete set of times which we will take, for simplicity, to be the nonnegative integers. We will refer to this scheme as the agreement algorithm.

We will assume that the row-sums of are equal to 1, so that is a stochastic matrix. In particular, is a weighted average of the values held by the nodes at time . We are interested in conditions that guarantee the convergence of each to a constant, independent of .

Throughout, we assume the following.

###### Assumption 3.1 (non-vanishing weights).

The matrix is nonnegative, stochastic, and has positive diagonal. Moreover, there exists some such that if then .

Intuitively, whenever , node communicates its current value to node . Each node updates its own value by forming a weighted average of its own value and the values it has just received from other nodes.

The communication pattern at each time step can be described in terms of a directed graph , where if and only if . Note that for all since has positive diagonal. A minimal assumption is that starting at an arbitrary time , and for any , , there is a sequence of communications through which node will influence (directly or indirectly) the value held by node . This is Assumption 2.1 from the previous chapter.

We note various special cases of possible interest.

Fixed coefficients: There is a fixed matrix , with entries such that, for each , and for each , we have (depending on whether there is a communication from to at that time). This is the case presented in [17].

Symmetric model: If then . That is, whenever communicates to , there is a simultaneous communication from to .

Equal neighbor model: Here,

 aij(t)={1/di(t), if j∈Ni(t),0, if j∉Ni(t),

This model is a linear version of a model considered by Vicsek et al. [97]. Note that here the constant of Assumption 3.1 is equal to .

Metropolis model: Here,

 aij(t)={1/max(di(t),dj(t)), if j∈Ni(t),i≠j0, if j∉Ni(t),

and

 aii(t)=1−n∑j=1aij(t).

The Metropolis model is similar to the equal-neighbor model, but has the advantage of symmetry: .

Pairwise averaging model ([23]): This is the special case of both the symmetric model and of the equal neighbor model in which, at each time, there is a set of disjoint pairs of nodes who communicate with each other. If communicates with , then . Note that the sum is conserved; therefore, if consensus is reached, it has to be on the average of the initial values of the nodes.

The assumption below is a strengthening of Assumption 2.1 on connectivity. We will see that it is sometimes necessary for convergence.

###### Assumption 3.2 (B-connectivity).

There exists an integer such that the directed graph

 (N,E(kB)∪E((k+1)B)∪⋯∪E((k+1)B−1))

is strongly connected for all integer .

### 3.2 Convergence results in the absence of delays.

We say that the agreement algorithm guarantees asymptotic consensus if the following holds: for every , and for every sequence allowed by whatever assumptions have been placed, there exists some such that , for all .

###### Theorem 3.1.

Under Assumptions 3.1 (non-vanishing weights) and 3.2 (-connectivity), the agreement algorithm guarantees asymptotic consensus.

Theorem 3.1 may be found in [55]; a slightly different version is in [95, 94]. We next give an informal account of its proof.

###### Sketch of proof.

The proof has several steps.

Step 1: Let us define the notion of a path in the time-varying graph . A path from to of length starting at time is a sequence of edges such that , , and and so on. We will use to denote the product

 c(p)=l−1∏i=0aki,ki+1.

Define

 Φ(t1,t2)=A(t2−1)A(t2−2)⋯A(t1),

and let the ’th entry of this matrix be denoted by . The following fact can be established by induction:

 ϕi,j(t1,t2)=∑ paths p from i to j % of length t2−t1 starting at time t1c(p).

A consequence is that if , then Assumption 3.1 implies .

Step 2: Assumptions 3.1 and 3.2 have the following implication: for any two nodes , there is a path of length that begins at and ends at .

This implication may be proven by induction on the following statement: for any node there are at least distinct nodes such that there is a path of length from to . The proof crucially relies on Assumption 3.1 which implies that all the self loops belong to every edge set .

Step 3: Putting Steps 1 and 2 together, we get that is a matrix whose every entry is bounded below by . The final step is to argue that

 QnQn−1⋯Q1x

converges to a multiple of the all-ones vector 1 for any sequence of matrices having this property and any initial vector . This is true because for any such matrix ,

 maxk(Qix)k−mink(Qix)k≤(1−ηnB)(maxkxk−minkxk).

In the absence of -connectivity, the algorithm does not guarantee asymptotic consensus, as shown by Example 1 below (Exercise 3.1, in p. 517 of [17]). In particular, convergence to consensus fails even in the special case of the equal neighbor model. The main idea is that the agreement algorithm can closely emulate a nonconvergent algorithm that keeps executing the three instructions , , , one after the other.

Example 1. Let , and suppose that . Let be a small positive constant. Consider the following sequence of events. Node 3 communicates to node 1; node 1 forms the average of its own value and the received value. This is repeated times, where is large enough so that . Thus, . We now let node 2 communicates to node 3, times, where is large enough so that . In particular, . We now repeat the above two processes, infinitely many times. During the th repetition, is replaced by (and get adjusted accordingly). Furthermore, by permuting the nodes at each repetition, we can ensure that Assumption 2.1 is satisfied. After repetitions, it can be checked that will be within of a unit vector. Thus, if we choose the so that , asymptotic consensus will not be obtained.

On the other hand, in the presence of symmetry, the -connectivity Assumption 3.2 is unnecessary. This result is proved in [70] and [28] for the special case of the symmetric equal neighbor model and in [76, 52], for the more general symmetric model. A more general result will be established in Theorem 3.4 below.

###### Theorem 3.2.

Under Assumptions 2.1 and 3.1, and for the symmetric model, the agreement algorithm guarantees asymptotic consensus.

### 3.3 Convergence in the presence of delays.

The model considered so far assumes that messages from one node to another are immediately delivered. However, in a distributed environment, and in the presence of communication delays, it is conceivable that a node will end up averaging its own value with an outdated value of another node. A situation of this type falls within the framework of distributed asynchronous computation developed in [17].

Communication delays are incorporated into the model as follows: when node , at time , uses the value from another node, that value is not necessarily the most recent one, , but rather an outdated one, , where , and where represents communication and possibly other types of delay. In particular, is updated according to the following formula:

 xi(t+1)=n∑j=1aij(t)xj(τij(t)). (3.2)

We make the following assumption on the .

###### Assumption 3.3.

(Bounded delays) (a) If , then .
(b) , for all , .
(c) There exists some such that , for all , , .

Assumption 3.3(a) is just a convention: when , the value of has no effect on the update. Assumption 3.3(b) is quite natural, since an node generally has access to its own most recent value. Assumption 3.3(c) requires delays to be bounded by some constant .

The next result, from [94, 95], is a generalization of Theorem 3.1. The idea of the proof is similar to the one outlined for Theorem 3.1, except that we now define and . For convenience, we will adopt the definition that for all negative . Once more, one shows that the difference decreases by a constant factor after a bounded amount of time.

###### Theorem 3.3.

Under Assumptions 3.1, 3.2, 3.3 (non-vanishing weights, bounded intercommunication intervals, and bounded delays), the agreement algorithm with delays [cf. Eq. (3.2)] guarantees asymptotic consensus.

Theorem 3.3 assumes bounded intercommunication intervals and bounded delays. The example that follows (Example 1.2, in p. 485 of [17]) shows that Assumption 3.3(d) (bounded delays) cannot be relaxed. This is the case even for a symmetric model, or the further special case where has exactly four arcs , , , and at any given time , and these satisfy , as in the pairwise averaging model.

Example 2. We have two nodes who initially hold the values and , respectively. Let be an increasing sequence of times, with and . If , the nodes update according to

 x1(t+1) = (x1(t)+x2(tk))/2, x2(t+1) = (x1(tk)+x2(t))/2.

We will then have and , where can be made arbitrarily small, by choosing large enough. More generally, between time and , the absolute difference contracts by a factor of , where the corresponding contraction factors approach 1. If the are chosen so that , then , and the disagreement does not converge to zero.

According to the preceding example, the assumption of bounded delays cannot be relaxed. On the other hand, the assumption of bounded intercommunication intervals can be relaxed, in the presence of symmetry, leading to the following generalization of Theorem 3.2.

###### Theorem 3.4.

Under Assumptions 2.1 (connectivity), 3.1 (non-vanishing weights), and 3.3 (bounded delays), and for the symmetric model, the agreement algorithm with delays [cf. Eq. (3.2)] guarantees asymptotic consensus.

###### Proof.

Let

 Mi(t) = max{xi(t),xi(t−1),…,xi(t−B+1)}, M(t) = maxiMi(t), mi(t) = min{xi(t),xi(t−1),…,xi(t−B+1)}, m(t) = minimi(t).

Recall that we are using the convention that for all negative . An easy inductive argument, as in p. 512 of [17], shows that the sequences and are nondecreasing and nonincreasing, respectively. The convergence proof rests on the following lemma.

###### Lemma 3.1.

If and , then there exists a time such that .

Given Lemma 1, the convergence proof is completed as follows. Using the linearity of the algorithm, there exists a time such that . By applying Lemma 1, with replaced by , and using induction, we see that for every there exists a time such that , which converges to zero. This, together with the monotonicity properties of and , implies that and converge to a common limit, which is equivalent to asymptotic consensus. ∎

###### Proof of Lemma 3.1.

For , we say that “Property holds at time ” if there exist at least indices for which .

We assume, without loss of generality, that and . Then, for all by the monotonicity of . Furthermore, there exists some and some such that . Using the inequality , we obtain . This shows that there exists a time at which property holds.

We continue inductively. Suppose that and that Property holds at some time . Let be a set of cardinality containing indices for which , and let be the complement of . Let be the first time, greater than or equal to , at which , for some and (i.e., an node in gets to influence the value of an node in ). Such a time exists by the connectivity assumption (Assumption 2.1).

Note that between times and , the nodes in the set only form convex combinations between the values of the nodes in the set (this is a consequence of the symmetry assumption). Since all of these values are bounded below by , it follows that this lower bound remains in effect, and that , for all .

For times , and for every , we have , which implies that , for . Therefore, , for all .

Consider now an node for which . We have

 xi(t′+1)≥aij(t′)xj(τij(t′))≥ηmi(t′)≥ηkB+1.

Using also the fact , we obtain that . Therefore, at time , we have nodes with (namely, the nodes in , together with node ). It follows that Property is satisfied at time .

This inductive argument shows that there is a time at which Property is satisfied. At that time for all , which implies that . On the other hand, , which proves that . ∎

Now that we have proved Theorem 3.4, let us give a variation of it which will ensure not only convergence but convergence to the average.

###### Assumption 3.4.

(Double stochasticity) The matrix is column-stochastic for all , i.e.,

 n∑i=1aij(t)=1,

for all and .

Note that Assumption 3.1 ensures that the matrix is only row-stochastic. The above assumption together with Assumption 3.1 ensures that is actually doubly stochastic.

###### Theorem 3.5.

Under Assumptions 2.1 (connectivity), 3.1 (non-vanishing weights), and 3.4 (double stochasticity), and for the symmetric model, the agreement algorithm (without delays) satisfies

 limt→∞xi(t)=1nn∑i=1xi(0).
###### Proof.

By Theorem 3.4, every converges to the same value. Assumption 3.4 ensures that is preserved from iteration to iteration:

 n∑i=1xi(t+1)=1Tx(t+1)=1TA(t)x(t)=1% Tx(t)=n∑i=1xi(t),

where we use the double stochasticity of to conclude that . This immediately implies that the final limit is the average of the initial values. ∎

Finally, let us observe that Theorem 2.3 from the previous chapter is a special case of the theorem we have just proved.

###### Proof of Theorem 2.1.

Observe that the assumptions of Theorem 3.5 are present among the assumptions of Theorem 2.1, except for Assumption 3.4 which needs to be verified. To argue that the matrices in Theorem 2.1 are doubly stochastic, we just observe that they are symmmetric and stochastic. ∎

### 3.4 Relaxing symmetry

The symmetry condition [ iff ] used in Theorem 3.4 is somewhat unnatural in the presence of communication delays, as it requires perfect synchronization of the update times. A looser and more natural assumption is the following.

###### Assumption 3.5 (Bounded round-trip times).

There exists some such that whenever , then there exists some that satisfies and .

Assumption 3.5 allows for protocols such as the following. Node sends its value to node . Node responds by sending its own value to node . Both nodes update their values (taking into account the received messages), within a bounded time from receiving the other node’s value. In a realistic setting, with unreliable communications, even this loose symmetry condition may be impossible to enforce with absolute certainty. One can imagine more complicated protocols based on an exchange of acknowledgments, but fundamental obstacles remain (see the discussion of the “two-army problem” in pp. 32-34 of [15]). A more realistic model would introduce a positive probability that some of the updates are never carried out. (A simple possibility is to assume that each , with , is changed to a zero, independently, and with a fixed probability.) The convergence result that follows remains valid in such a probabilistic setting (with probability 1). Since no essential new insights are provided, we only sketch a proof for the deterministic case.

###### Theorem 3.6.

Under Assumptions 2.1 (connectivity), 3.1 (non-vanishing weights), 3.3 (delays) and 3.5 (bounded round-trip times) the agreement algorithm with delays [cf. Eq. (3.2)] guarantees asymptotic consensus.

###### Proof outline.

A minor change is needed in the proof of Lemma 1. In particular, we define as the event that there exist at least indices for which . It follows that holds at time .

By induction, let hold at time , and let be the set of cardinality containing indices for which . Furthermore, let be the first time after time that where exactly one of is in . Along the same lines as in the proof of Lemma 1, for ; since , it follows that for each . By our assumptions, exactly one of , is in . If , then and consequently . If , then there must exist a time with . It follows that:

 mj(τ+2B) ≥ ητ+2