# Distributed Decision-Making over Adaptive Networks

Sheng-Yuan Tu, and Ali H. Sayed, Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. This work was supported in part by NSF grant CCF-1011918. An earlier conference version of this work appeared in [1]. The authors are with the Department of Electrical Engineering, University of California, Los Angeles (e-mail: shinetu@ee.ucla.edu; sayed@ee.ucla.edu).
###### Abstract

In distributed processing, agents generally collect data generated by the same underlying unknown model (represented by a vector of parameters) and then solve an estimation or inference task cooperatively. In this paper, we consider the situation in which the data observed by the agents may have risen from two different models. Agents do not know beforehand which model accounts for their data and the data of their neighbors. The objective for the network is for all agents to reach agreement on which model to track and to estimate this model cooperatively. In these situations, where agents are subject to data from unknown different sources, conventional distributed estimation strategies would lead to biased estimates relative to any of the underlying models. We first show how to modify existing strategies to guarantee unbiasedness. We then develop a classification scheme for the agents to identify the models that generated the data, and propose a procedure by which the entire network can be made to converge towards the same model through a collaborative decision-making process. The resulting algorithm is applied to model fish foraging behavior in the presence of two food sources.

{keywords}

## I Introduction

Self-organization is a remarkable property of biological networks [2, 3], where various forms of complex behavior are evident and result from decentralized interactions among agents with limited capabilities. One example of sophisticated behavior is the group decision-making process by animals [4]. For example, it is common for biological networks to encounter situations where agents need to decide between multiple options, such as fish deciding between following one food source or another [5], and bees or ants deciding between moving towards a new hive or another [6, 7]. Although multiple options may be available, the agents are still able to reach agreement in a decentralized manner and move towards a common destination (e.g., [8]).

In previous works, we proposed and studied several diffusion strategies [9, 10, 11, 12, 13, 14] that allow agents to adapt and learn through a process of in-network collaboration and learning. References [13, 14] provide overviews of diffusion techniques and their application to distributed adaptation, learning, and optimization over networks. Examples of further applications and studies appear, e.g., in [15, 16, 17, 18, 19, 20]. Diffusion networks consist of a collection of adaptive agents that are able to respond to excitations in real-time. Compared with the class of consensus strategies [21, 22, 23, 24, 25, 26, 27], diffusion networks have been shown to remain stable irrespective of the network topology, while consensus networks can become unstable even when each agent is individually stable [28]. Diffusion strategies have also been shown to lead to improved convergence rate and superior mean-square-error performance [28, 14]. For these reasons, we focus in the remainder of this paper on the use of diffusion strategies for decentralized decision-making.

Motivated by the behavior of biological networks, we study distributed decision-making over networks where agents are subject to data arising from two different models. The agents do not know beforehand which model accounts for their data and the data of their neighbors. The objective of the network is for all agents to reach agreement on one model and to estimate and track this common model cooperatively. The task of reaching agreement over a network of agents subjected to different models is more challenging than earlier works on inference under a single data model. The difficulty is due to various reasons. First, traditional (consensus and diffusion) strategies will converge to a biased solution (see Eq. (14)). We therefore need a mechanism to compensate for the bias. Second, each agent now needs to distinguish between which model each of its neighbors is collecting data from (this is called the observed model) and which model the network is evolving to (this is called the desired model). In other words, in addition to the learning and adaptation process for tracking, the agents should be equipped with a classification scheme to distinguish between the observed and desired models. The agents also need to be endowed with a decision process to agree among themselves on a common (desired) model to track. Moreover, the classification scheme and the decision-making process will need to be implemented in a fully distributed manner and in real-time, alongside the adaptation process.

There have been useful prior works in the literature on formations over multi-agent networks [29, 30, 31, 32, 33, 34, 35] and opinion formation over social networks [36, 37, 38] using, for example, consensus strategies. These earlier works are mainly interested in having the agents reach an average consensus state, whereas in our problem formulation agents will need to reach one of the models and not the average of both models. Another difference between this work and the earlier efforts is our focus on combining real-time classification, decision-making, and adaptation into a single integrated framework running at each agent. To do so, we need to show how the distributed strategy should be modified to remove the bias that would arise due to the multiplicity of models — without this step, the combined decision-making and adaptation scheme will not perform as required. In addition, in our formulation, the agents need to continuously adjust their decisions and their estimates because the models are allowed to change over time. In this way, reaching a static consensus is not the objective of the network. Instead, the agents need to continuously adjust and track in a dynamic environment where decisions and estimates evolve with time as necessary. Diffusion strategies endow networks with such tracking abilities — see, e.g., Sec. VII of [39], where it is shown how well these strategies track as a function of the level of non-stationarity in the underlying models.

## Ii Diffusion Strategy

Consider a collection of agents (or nodes) distributed over a geographic region. The set of neighbors (i.e. neighborhood) of node is denoted by ; the number of nodes in is denoted by . At every time instant, , each node is able to observe realizations of a scalar random process and a row random regressor with a positive-definite covariance matrix, . The regressors are assumed to be temporally white and spatially independent, i.e., in terms of the Kronecker delta function. Note that we are denoting random quantities by boldface letters and their realizations or deterministic quantities by normal letters. The data collected at node are assumed to originate from one of two unknown column vectors of size in the following manner. We denote the generic observed model by ; node does not know beforehand the observed model. The data at node are related to its observed model via a linear regression model of the form:

 dk(i)=uk,iz∘k+vk(i) (1)

where is measurement noise with variance and assumed to be temporally white and spatially independent. The noise is assumed to be independent of for all . All random processes are zero mean.

The objective of the network is to have all agents converge to an estimate for one of the models. For example, if the models happen to represent the location of food sources [40, 13], then this agreement will make the agents move towards one particular food source in lieu of the other source. More specifically, let denote the estimator for at node at time . The network would like to reach an ageement on a common , such that

 wk,i→w∘q for q=0 or q=1 and for% all k as i→∞ (2)

where convergence is in some desirable sense (such as the mean-square-error sense).

Several adaptive diffusion strategies for distributed estimation under a common model scenario were proposed and studied in [9, 11, 10, 12, 13], following the developments in [41, 42, 43, 44, 45] — overviews of these results appear in [13, 14]. One such scheme is the adaptive-then-combine (ATC) diffusion strategy [45, 11]. It operates as follows. We select an matrix with nonnegative entries satisfying:

 \mathds1TNA=\mathds1TNandal,k=0 if % l∉Nk (3)

where is the vector of size with all entries equal to one. The entry denotes the weight that node assigns to data arriving from node (see Fig. 1). The ATC diffusion strategy updates to as follows:

 ψk,i =wk,i−1+μk⋅uTk,i[dk(i)−uk,iwk,i−1] (4) wk,i =∑l∈Nkal,kψl,i (5)

where is the constant positive step-size used by node . The first step (4) involves local adaptation, where node uses its own data to update the weight estimate at node from to an intermediate value . The second step (5) is a combination step where the intermediate estimates from the neighborhood of node are combined through the weights to obtain the updated weight estimate . Such diffusion strategies have found applications in several domains including distributed optimization, adaptation, learning, and the modeling of biological networks — see, e.g., [13, 14, 40] and the references therein. Diffusion strategies were also used in some recent works [46, 47, 48, 49] albeit with diminishing step-sizes () to enforce consensus among nodes. However, decaying step-sizes disable adaptation once they approach zero. Constant step-sizes are used in (4)-(5) to enable continuous adaptation and learning, which is critical for the application under study in this work.

When the data arriving at the nodes could have risen from one model or another, the distributed strategy (4)-(5) will not be able to achieve agreement as in (2) and the resulting weight estimates will tend towards a biased value. We first explain how this degradation arises and subsequently explain how it can be remedied.

###### Assumption 1 (Strongly connected network).

The network topology is assumed to be strongly connected so that the corresponding combination matrix is primitive, i.e., there exists an integer power such that for all and .

As explained in [13], Assumption 1 amounts to requiring the network to be connected (where a path with nonzero weights exists between any two nodes), and for at least one node to have a non-trivial self-loop (i.e., for at least one ). We conclude from the Perron-Frobenius Theorem [50, 51] that every primitive left-stochastic matrix has a unique eigenvalue at one while all other eigenvalues are strictly less than one in magnitude. Moreover, if we denote the right-eigenvector that is associated with the eigenvalue at one by and normalize its entries to add up to one then it holds that:

 Ac=c,\mathds1TNc=1,and0

Let us assume for the time being that the agents in the network have agreed on converging towards one of the models (but they do not know beforehand which model it will be). We denote the desired model generically by . In Section IV, we explain how this agreement process can be attained. Here we explain that even when agreement is present, the diffusion strategy (4)-(5) leads to biased estimates unless it is modified in a proper way. To see this, we introduce the following error vectors for any node :

 ~wk,i≜w∘q−wk,i and ~z∘k≜w∘q−z∘k. (7)

Then, using model (1), we obtain that the update vector in (4) becomes

 hk,i ≜uTk,i[dk(i)−uk,iwk,i−1] =uTk,iuk,i~wk,i−1−uTk,iuk,i~z∘k+uTk,ivk(i). (8)

We collect all error vectors across the network into block vectors: and . We also collect the step-sizes into a block diagonal matrix and introduce the extended combination matrix:

 M=diag{μkIM}andA≜A⊗IM (9)

where denotes the identity matrix of size . In (9), the notation constructs a diagonal matrix from its arguments and the symbol denotes the Kronecker product of two matrices. Moreover, the notation denotes the vector that is obtained by stacking its arguments on top of each other. Then, starting from (4)-(5) and using relation (8), we can verify that the global error vector of the network evolves over time according to the recursion:

 ~wi=Bi~wi−1+yi (10)

where the matrix and the vector are defined in Table I with and . Note that the matrix is a random matrix due to the randomness of the regressors . Since the regressors are temporally white and spatially independent, then is independent of . In addition, since is independent of , the vector in has zero mean. Then, from (10), the mean of evolves over time according to the recursion:

 E~wi=B⋅E~wi−1+y (11)

where and are defined in Table I with . It can be easily verified that a necessary and sufficient condition to ensure the convergence of in (11) to zero is

 ρ(B)<1andy=0 (12)

where denotes the spectral radius of its argument. It was verified in [13, 28] that a sufficient condition to ensure is to select the site-sizes such that

 0<μk<2ρ(Ru,k) (13)

for all . This conclusion is independent of . However, for the second condition in (12), we note that in general, the vector cannot be zero no matter how the nodes select the combination matrix . When this happens, the weight estimate will be biased. Let us consider the example with three nodes in Fig. 2 where node 1 observes data from model , while nodes 2 and 3 observe data from another model . The matrix in this case is shown in Fig. 2 with the parameters lying in the interval and . We assume that the step-sizes and regression covariance matrices are the same, i.e., and for all . If the desired model of the network is , then the third block of becomes , which can never become zero no matter what the parameters are. More generally, using results on the limiting behavior of the estimation errors from [52], we can characterize the limiting point of the diffusion strategy (4)-(5) as follows.

###### Lemma 1.

For the diffusion strategy (4)-(5) with and for all and for sufficiently small step-sizes, all weight estimators converge to a limit point in the mean-square sense, i.e., is bounded and of the order of , where is given by

 w∘=N∑k=1ckz∘k (14)

where the vector is defined in (6).

###### Proof.

The result follows from Eq. (25) in [52] by noting that the variable used in [52] is given by . ∎

Thus, when the agents collect data from different models, the estimates using the diffusion strategy (4)-(5) converge to a convex combination of these models given by (14), which is different from any of the individual models because for all . A similar conclusion holds for the case of non-uniform step-sizes and covariance matrices .

## Iii Modified Diffusion Strategy

To deal with the problem of bias, we now show how to modify the diffusion strategy (4)-(5). We observe from the example in Fig. 2 that the third entry of the vector cannot be zero because the neighbor of node observes data arising from a model that is different from the desired model. Note from (8) that the bias term arises from the gradient direction used in computing the intermediate estimates in (4). These observations suggest that to ensure unbiased mean convergence, a node should not combine intermediate estimates from neighbors whose observed model is different from the desired model. For this reason, we shall replace the intermediate estimates from these neighbors by their previous estimates in the combination step (5). Specifically, we shall adjust the diffusion strategy (4)-(5) as follows:

 ψk,i =wk,i−1+μk⋅uTk,i[dk(i)−uk,iwk,i−1] (15) wk,i =∑l∈Nk(a(1)l,kψl,i+a(2)l,kwl,i−1) (16)

where the and are two sets of nonnegative scalars and their respective combination matrices and satisfy

 A1+A2=A (17)

with being the original left-stochastic matrix in (3). Note that step (15) is the same as step (4). However, in the second step (16), nodes aggregate the from their neighborhood. With such adjustment, we will verify that by properly selecting , unbiased mean convergence can be guaranteed. The choice of which entries of go into or will depend on which of the neighbors of node are observing data arising from a model that agrees with the desired model for node .

### Iii-a Construction of Matrices A1 and A2

To construct the matrices we associate two vectors with the network, and . Both vectors are of size . The vector is fixed and its th entry, , is set to when the observed model for node is ; otherwise, it is set to . On the other hand, the vector is evolving with time; its th entry is set to when the desired model for node is ; otherwise, it is set equal to . Then, we shall set the entries of and according to the following rules:

 a(1)l,k,i={al,k,if l∈Nk and f(l)=gi(k)0,otherwise (18) a(2)l,k,i={al,k,if l∈Nk and f(l)≠gi(k)0,otherwise. (19)

That is, nodes that observe data arising from the same model that node wishes to converge to will be reinforced and their intermediate estimates will be used (their combination weights are collected into matrix ). On the other hand, nodes that observe data arising from a different model than the objective for node will be de-emphasized and their prior estimates will be used in the combination step (16) (their combination weights are collected into matrix ). Note that the scalars in (18)-(19) are now indexed with time due to their dependence on .

### Iii-B Mean-Error Analysis

It is important to note that to construct the combination weights from (18)-(19), each node needs to know what are the observed models influencing its neighbors (i.e., for ); it also needs to know how to update its objective in so that the converge to the same value. In the next two sections, we will describe a distributed decision-making procedure by which the nodes are able to achieve agreement on . We will also develop a classification scheme to estimate using available data. More importantly, the convergence of the vectors will occur before the convergence of the adaptation process to estimate the agreed-upon model. Therefore, let us assume for the time being that the nodes know the of their neighbors and have achieved agreement on the desired model, which we are denoting by , so that (see Eq. (24) in Theorem 2)

 gi(1)=gi(2)=⋯=gi(N)=q, for all i. (20)

Using relation (8) and the modified diffusion strategy (15)-(16), the recursion for the global error vector is again given by (10) with the matrix and the vector defined in Table I and the combination matrices and defined in a manner similar to in (9). We therefore get the same mean recursion as (11) with the matrix and the vector defined in Table I. The following result establishes asymptotic mean convergence for the modified diffusion strategy (15)-(16).

###### Theorem 1.

Under condition (20), the modified diffusion strategy (15)-(16) converges in the mean if the matrices and are constructed according to (18)-(19) and the step-sizes satisfy condition (13) for those nodes whose observed model is the same as the desired model for the network.

###### Proof.

See Appendix A. ∎

We conclude from the argument in Appendix A that the net effect of the construction (18)–(19) is the following. Let denote the desired model that the network wishes to converge to. We denote by the subset of nodes that receive data arising from the same model. The remaining nodes belong to the set . Nodes that belong to the set run the traditional diffusion strategy (4)-(5) using the combination matrix and their step-sizes are required to satisfy (13). The remaining nodes in set their step-sizes to zero and run only step (5) of the diffusion strategy. These nodes do not perform the adaptive update (4) and therefore their estimates satisfy for all .

## Iv Distributed Decision-Making

The decision-making process is motivated by the process used by animal groups to reach agreement, and which is known as quorum response [6, 7, 4]. The procedure is illustrated in Fig. 3 and described as follows. At time , every node has its previous desired model , now modeled as a random variable since it will be constructed from data realizations that are subject to randomness. Node exchanges with its neighbors and constructs the set

 Ngk,i−1={l|l∈Nk,gi−1(l)=gi−1(k)}. (21)

That is, the set contains the subset of nodes that are in the neighborhood of and have the same desired model as node at time . This set changes over time. Let denote the number of nodes in . Since at least one node (node ) belongs to , we have that . Then, one way for node to participate in the quorum response is to update its desired model according to the rule:

 gi(k)={gi−1(k),with % probability qk,i−11−gi−1(k),with probability 1−qk,i−1 (22)

where the probability measure is computed as:

 qk,i−1=[ngk(i−1)]K[ngk(i−1)]K+[nk−ngk(i−1)]K>0 (23)

and the exponent is a positive constant (e.g., ). That is, node determines its desired model in a probabilistic manner, and the probability that node maintains its desired target is proportional to the th power of the number of neighbors having the same desired model (see Fig. 3(b)). Using the above stochastic formulation, we are able to establish agreement on the desired model among the nodes.

###### Theorem 2.

For a connected network starting from an arbitrary initial selection for the desired models vector at time , and applying the update rule (21)-(23), then all nodes eventually achieve agreement on some desired model, i.e.,

 gi(1)=gi(2)=…=gi(N),as i→∞. (24)
###### Proof.

See Appendix B. ∎

Although rule (21)-(23) ensures agreement on the decision vector, this construction is still not a distributed solution for one subtle reason: nodes need to agree on which index (0 or 1) to use to refer to either model . This task would in principle require the nodes to share some global information. We circumvent this difficulty and develop a distributed solution as follows. Moving forward, we now associate with each node two local vectors ; these vectors will play the role of local estimates for the network vectors . Each node will then assign the index value of one to its observed model, i.e., each node sets . Then, for every , the entry is set to one if it represents the same model as the one observed by node ; otherwise, is set to zero. The question remains about how node knows whether its neighbors have the same observed model as its own (this is discussed in the next section). Here we comment first on how node adjusts the entries of its vector . Indeed, node knows its desired model value from time . To assign the remaining neighborhood entries in the vector , the nodes in the neighborhood of node first exchange their desired model indices with node , that is, they send the information to node . However, since from node is set relative to its , node needs to set based on the value of . Specifically, node will set according to the rule:

 gk,i−1(l)={gl,i−1(l),if fk(l)=fk(k)1−gl,i−1(l),otherwise. (25)

That is, if node has the same observed model as node , then node simply assigns the value of to .

In this way, computations that depend on the network vectors will be replaced by computations using the local vectors . That is, the quantities in (18)-(19) and (21)-(23) are now replaced by . We verify in the following that using the network vectors is equivalent to using the local vectors .

###### Lemma 2.

It holds that

 f(l)⊕gi(k) =fk(l)⊕gk,i(k) (26) gi(l)⊕gi(k) =gk,i(l)⊕gk,i(k) (27)

where the symbol denotes the exclusive-OR operation.

###### Proof.

Since the values of are set relative to , it holds that

 f(k)⊕f(l) =fk(k)⊕fk(l) (28) f(k)⊕gi(k) =fk(k)⊕gk,i(k) (29) f(k)⊕gi(l) =fk(k)⊕gk,i(l) (30)

Then relations (26) and (27) hold in view of the fact:

 (a⊕b)⊕(a⊕e)=b⊕e (31)

for any , , and . ∎

With these replacements, node still needs to set the entries that correspond to its neighbors, i.e., it needs to differentiate between their underlying models and whether their data arise from the same model as node or not. We propose next a procedure to determine at node using the available estimates for .

## V Model Classification Scheme

To determine the vector , we introduce the belief vector , whose th entry, , will be a measure of the belief by node that node has the same observed model. The value of lies in the range . The higher the value of is, the more confidence node has that node is subject to the same model as its own model. In the proposed construction, the vector will be changing over time according to the estimates . Node will be adjusting according to the rule:

 bk,i(l)={αbk,i−1(l)+(1−α),to increase beliefαbk,i−1(l),to decrease belief (32)

for some positive scalar , e.g., . That is, node increases the belief by combining in a convex manner the previous belief with the value one. Node then estimates according to the rule:

 ^fk,i(l)={1,if bk,i(l)≥0.50,otherwise (33)

where denotes the estimate for at time and is now a random variable since it will be computed from data realizations. Note that the value of may change over time due to .

Since all nodes have similar processing abilities, it is reasonable to consider the following scenario.

###### Assumption 2 (Homogeneous agents).

All nodes in the network use the same step-size, , and they observe data arising from the same covariance distribution so that for all .

Agents still need to know whether to increase or decrease the belief in (32). We now suggest a procedure that allows the nodes to estimate the vectors by focusing on their behavior in the far-field regime when their weight estimates are usually far from their observed models (see (37) for a more specific description). The far-field regime generally occurs during the initial stages of adaptation and, therefore, the vectors can be determined quickly during these initial iterations.

To begin with, we refer to the update vector from (8), which can be written as follows for node :

 hl,i =μ−1(ψl,i−wl,i−1) =uTl,iul,i(z∘l−wl,i−1)+uTl,ivl(i). (34)

Taking expectation of both sides conditioned on , we have that

 ¯hl,i≜E[hl,i|wl,i−1=wl,i−1]=Ru(z∘l−wl,i−1). (35)

That is, the expected update direction given the previous estimate, , is a scaled vector pointing from towards with scaling matrix . Note that since is positive-definite, then the term lies in the same half plane of the vector , i.e., . Therefore, the update vector provides useful information about the observed model at node . For example, this term tells us how close the estimate at node is to its observed model. When the magnitude of is large, or the estimate at node is far from its observed model , then we say that node is in a far-field regime. On the other hand, when the magnitude of is small, then the estimate is close to and we say that the node is operating in a near-field regime. The vector can be estimated by the first-order recursion:

 ^hl,i=(1−ν)^hl,i−1+νμ−1(ψl,i−wl,i−1) (36)

where denotes the estimate for and is a positive step-size. Note that since the value of varies with , which is updated using the step-size , then the value of should be set large enough compared to (e.g., and are used in our simulations) so that recursion (36) can track variations in over time. Moreover, since node has access to the if node is in its neighborhood, node can compute on its own using (36). In the following, we describe how node updates the belief using .

During the initial stage of adaptation, nodes and are generally away from their respective observed models and both nodes are therefore in the far-field. This state is characterized by the conditions

 ∥^hk,i∥>ηand∥^hl,i∥>η (37)

for some threshold . If both nodes have the same observed model, then the estimates and are expected to have similar direction towards the observed model (see Fig. 4(a)). Node will increase the belief using (32) if

 ^hTk,i^hl,i>0. (38)

Otherwise, node will decrease the belief . That is, when both nodes are in the far-field, then node increases its belief that node shares the same observed model when the vectors and lie in the same quadrant. Note that it is possible for node to increase even when nodes and have distinct models. This is because it is difficult to differentiate between the models during the initial stages of adaptation. This situation is handled by the evolving network dynamics as follows. If node considers that the data from node originate from the same model, then node will use the intermediate estimate from node in (16). Eventually, from Lemma 1, the estimates at these nodes get close to a convex combination of the underlying models, which would then enable node to distinguish between the two models and to decrease the value of . Clearly, for proper resolution, the distance between the models needs to be large enough so that the agents can resolve them. When the models are very close to each other so that resolution is difficult, the estimates at the agents will converge towards a convex combination of the models (which will be also close to the models). Therefore, the belief is updated according to the following rule:

 bk,i(l)={αbk,i−1(l)+(1−α),if E1αbk,i−1(l),if Ec1 (39)

where and are the two events described by:

 E1 :∥^hk,i∥>η,∥^hl,i∥>η, and ^hTk,i^hl,i>0 (40) Ec1 :∥^hk,i∥>η,∥^hl,i∥>η, and ^hTk,i^hl,i≤0. (41)

Note that node updates the belief only when both nodes and are in the far-field.

## Vi Diffusion Strategy with Decision-Making

Combining the modified diffusion strategy (15)-(16), the combination weights (18)-(19), the decision-making process (21)-(23), and the classification scheme (33) and (39) with replaced by , we arrive at the listing shown in the table. It is seen from the algorithm that the adaptation and combination steps of diffusion, which correspond to steps 1) and 8), are now separated by several steps. The purpose of these intermediate steps is to select the combination weights properly to carry out the aggregation required by step 8). Note that to implement the algorithm, nodes need to exchange the quantities with their neighbors. We summarize the computational complexity and the amount of scalar exchanges of the conventional and modified diffusion strategies in Table II. Note that the modified strategy still requires in the order of computations per iteration. Nevertheless, the modified diffusion strategy requires about more additions and multiplications than conventional diffusion. This is because of the need to compute the terms in step 2). If the nodes can afford to exchange extra information, then instead of every node connected to node computing the term in step 2), this term can be computed locally by node and shared with its neighbors. This reveals a useful trade-off between complexity and information exchange.

Due to the dependency among the steps of the algorithm, the analysis of its behavior becomes challenging. However, by examining the various steps, some useful observations stand out. Specifically, it is observed that the convergence of the algorithm occurs in three phases as follows (see also Sec. VIII):

1. Convergence of the classification scheme: The first phase of convergence happens during the initial stages of adaptation. It is natural to expect that during this stage, all weight estimates are generally away from their respective models and the nodes operate in the far-field regime. Then, the nodes use steps 2)-5) to determine the observed models of their neighbors. We explain later in Eq. (77) in Theorem 3 that this construction is able to identify the observed models with high probability. In other words, the classification scheme is able to converge reasonably well and fast during the initial stages of adaptation.

2. Convergence of the decision-making process: The second phase of convergence happens right after the convergence of the classification scheme, once the have converged. Because the nodes now have correct information about their neighbor’s observed models, they use steps 5)-6) to determine their own desired models . The convergence of this step is ensured by Eq. (24) in Theorem 2.

3. Convergence of the diffusion strategy: After the classification and decision-making processes converge, the estimates remain largely invariant and the combination weights in step 7) therefore remain fixed for all practical purposes. Then, the diffusion strategy becomes unbiased and converges in the mean according to Theorem 1. Moreover, when the estimates are close to steady-state, those nodes whose observed models are the same as the desired model enter the near-field regime and they stop updating their belief vectors (this will be justified by the future result (75)).

## Vii Performance of Classification Procedure

It is clear that the success of the diffusion strategy and decision-making process depends on the reliability of the classification scheme in (33) and (39). In this section, we examine the probability of error for the classification scheme under some simplifying conditions to facilitate the analysis. This is a challenging task to pursue due to the stochastic nature of the classification and decision-making process, and due to the coupling among the agents. Our purpose in this section is to gain some insights into this process through a first-order approximate analysis.

Now, there are two types of error. When nodes and are subject to the same observed model (i.e., and ), then one probability of error is defined as:

 Pe,1 =Pr(^fk,i(l)=0|fk(l)=1) =Pr(bk,i(l)<0.5|z∘k=z∘l) (42)

where we used rule (33). The second type of probability of error occurs when both nodes have different observed models (i.e., when and ) and refers to the case:

 Pe,0 =Pr(^fk,i(l)=1|fk(l)=0) =Pr(bk,i(l)>0.5|z∘k≠z∘l). (43)

To evaluate the error probabilities in (42)-(43), we examine the probability distribution of the belief variable . Note from (39) that the belief variable can be expressed as:

 bk,i(l)=αbk,i−1(l)+(1−α)ξk,i(l) (44)

where is a Bernoulli random variable with

 ξk,i(l)={1,with probability p0,with probability 1−p. (45)

The value of depends on whether the nodes have the same observed models or not. When , the belief is supposed to be increased and the probability of detection, , characterizes the probability that is increased, i.e.,

 Pd=Pr(ξk,i(l)=1|z∘k=z∘l). (46)

In this case, the probability in (45) will be replaced by . On the other hand, when , the probability of false alarm, , characterizes the probability that the belief is increased when it is supposed to be decreased, i.e.,

 Pf=Pr(ξk,i(l)=1|z∘k≠z∘l) (47)

and we replace in (45) by . We will show later (see Lemma 4) how to evaluate the two probabilities and