Distributed Clustering and Learning Over Networks
Distributed processing over networks relies on in-network processing and cooperation among neighboring agents. Cooperation is beneficial when agents share a common objective. However, in many applications agents may belong to different clusters that pursue different objectives. Then, indiscriminate cooperation will lead to undesired results. In this work, we propose an adaptive clustering and learning scheme that allows agents to learn which neighbors they should cooperate with and which other neighbors they should ignore. In doing so, the resulting algorithm enables the agents to identify their clusters and to attain improved learning and estimation accuracy over networks. We carry out a detailed mean-square analysis and assess the error probabilities of Types I and II, i.e., false alarm and mis-detection, for the clustering mechanism. Among other results, we establish that these probabilities decay exponentially with the step-sizes so that the probability of correct clustering can be made arbitrarily close to one.
Distributed algorithms for learning, inference, modeling, and optimization by networked agents are prevalent in many domains and applicable to a wide range of problems [2, 3, 4, 5]. Among the various classes of algorithms, techniques that are based on first-order gradient-descent iterations are particularly useful for distributed processing due to their low complexity, low power demands, and robustness against imperfections or unmodeled effects. Three of the most studied classes are consensus algorithms [6, 5, 7, 8, 9], diffusion algorithms [10, 11, 12, 13, 14, 15, 16, 2], and incremental algorithms [17, 18, 19, 20, 21, 22]. The incremental techniques rely on the determination of a Hamiltonian cycle over the topology, which is generally an NP-hard problem and is therefore a hindrance to real-time adaptation, and even more so when the topology is dynamic and changes with time. For this reason, we will consider mainly learning algorithms of the consensus and diffusion types.
In this work we focus on the case in which constant step-sizes are employed in order to enable continuous adaptation and learning in response to streaming data. When diminishing step-sizes are used, the algorithms would cease to adapt after the step-sizes have approached zero, which is problematic for applications that require the network to remain continually vigilant and to track possible drifts in the data and clusters. Therefore, adaptation with constant step-sizes is necessary in these scenarios. It turns out that when constant step-sizes are used, the dynamics of the distributed (consensus or diffusion) strategies are modified in a non-trivial manner: the stochastic gradient noise that is present in their update steps does not die out anymore and it seeps into the operation of the algorithms. In other words, while this noise component would be annihilated by decaying step-sizes, it will remain persistently active during constant step-size adaptation. As such, it becomes important to evaluate how well constant step-size implementations can alleviate the influence of gradient noise. It was shown in [23, 2, 3] that consensus strategies can become problematic when constant step-sizes are employed. This is because of an asymmetry in their update relations, which can cause the state of the network to grow unbounded when these networks are used for adaptation. In comparison, diffusion networks do not suffer from this asymmetry problem and have been shown to be mean stable regardless of the topology of the network. This is a reassuring property, especially in the context of applications where the topology can undergo changes over time. These observations motivate us to focus our analysis on diffusion strategies, although the conclusions and arguments can be extended with proper adjustments to consensus strategies.
Now, most existing works on distributed learning algorithms focus on the case in which all agents in the network are interested in estimating a common parameter vector, which generally corresponds to the minimizer of some aggregate cost function (see, e.g., [2, 3, 4, 5] and the references therein). In this article, we are instead interested in scenarios where different clusters of agents within the network are interested in estimating different parameter vectors. There have been several useful works in this domain in the literature under various assumptions, including in the earlier version of this work in . This early investigation dealt only with the case of two separate clusters in the network with each cluster interested in one parameter vector. One useful application of this formulation in the context of biological networks was considered in , where each agent was assumed to collect data arising from one of two models (e.g., the location of two separate food sources). The agents did not know which model generated their observations and, yet, they needed to reach agreement about which model to follow (i.e., which food source to move towards). Another important extension dealing with multiple (more than two) models appears in [25, 26] where multi-task problems are introduced. In this formulation, different clusters of the agents are again interested in estimating different parameter vectors (called “tasks”) and the tasks of adjacent clusters are further assumed to be related to each other so that cooperation among clusters can still be beneficial. This formulation is useful in many scenarios, as already illustrated in , including in multiple target tracking [27, 28] and classification problems involving multiple models [29, 30, 31, 32, 33, 34]. Other useful variations of multi-task problems appear in , which assumes fully-connected networks, and in  where the agents have two types of parameters to estimate (a local parameter and a global parameter). These various works focus on mean-square-error (MSE) design, where the parameters of interest are estimated by seeking the minimizer of an MSE cost. Moreover, with the exception of [1, 26], it is generally assumed in these works that the agents know beforehand which clusters they belong to or which parameters they are interested in estimating.
In this article, we extend the approach of  and study multi-tasking adaptive networks under three conditions that are fundamentally different from previous studies. First, we go beyond mean-square-error estimation and allow for more general convex risk functions at the agents. This level of generality allows the framework to handle broader situations both in adaptation and learning, such as logistic regression for pattern classification purposes. Second, we do not assume any relation among the different objectives pursued by the clusters. In other words, we study the important problem where different components of the network are truly interested in different objectives and would like to avoid interference among clusters. And third, the agents do not know beforehand which clusters they belong to and which other agents are interested in the same objective.
For example, in an application involving a sensor network tracking multiple moving objects from various directions, it is reasonable to assume that the trajectories of these objects are independent of each other. In this case, only information shared within clusters is beneficial for learning; the information from agents in other clusters would amount to interference. This means that agents would need to cooperate with neighbors that belong to the same cluster and would need to cut their links to neighbors with different objectives. This task would be simple to achieve if agents were aware of their cluster information. However, we will not be making that assumption. The cluster information will need to be learned as well. This point highlights one major feature of our formulation: we do not assume that agents have full knowledge about their clusters. This assumption is quite common in the context of unsupervised machine learning [29, 33], where the collected measurement data are not labeled and there are multiple candidate models. If two neighboring agents are interested in the same model and they are aware of this fact, then they should exchange data and cooperate. However, the agents may not know this fact, so they cannot be certain about whether or not they should cooperate. Accordingly, in this work, we will devise an adaptive clustering and learning strategy that allows agents to learn which neighbors they should cooperate with. In doing so, the resulting algorithm enables the agents in a network to be correctly clustered and to attain improved learning performance through enhanced intra-cluster cooperation.
Notation: We use lowercase letters to denote vectors, uppercase letters for matrices, plain letters for deterministic variables, and boldface letters for random variables. We also use to denote transposition, for matrix inversion, for the trace of a matrix, and for the 2-norm of a matrix or the Euclidean norm of a vector. Besides, we use for matrices and to denote their Kronecker product, to demote that is positive semi-definite, and to demote that all entries of are nonnegative.
Ii Problem Formulation
We consider a network consisting of agents inter-connected via some topology. An individual cost function, , of a vector parameter , is associated with every agent . Each cost is assumed to be strictly-convex and is minimized at a unique point . According to the minimizers , agents in the network are categorized into mutually-exclusive clusters, denoted by , .
Definition 1 (Cluster)
Each cluster , denoted by , consists of the collection of agents whose individual costs share the common minimizer , i.e., for all . ∎
Since agents from different clusters do not share common minimizers, the network then aims to solve the clustered multi-task problem:
If the cluster information is available to the agents, then problem (1) can be decomposed into separate optimization problems over the sub-networks associated with the clusters:
for . Assuming the cluster topologies are connected, the corresponding minimizers can be sought by employing diffusion strategies over each cluster. In this case, collaborative learning will only occur within each cluster without any interaction across clusters. This means that for every agent that belongs to a particular cluster , i.e., , its neighbors, which belong to the set denoted by , will need to be segmented into two sets: one set is denoted by and consists of neighbors that belong to the same cluster , and the other set is denoted by and consists of neighbors that belong to other clusters. It is clear that
We illustrate a two-cluster network with a total of agents in Fig. (a)a. The agents in the clusters are denoted by blue and red circles, and are inter-connected by the underlying topology, so that agents may have in-cluster neighbors as well as neighbors from other clusters. For example, agent from blue cluster has the in-cluster sub-neighborhood , which is a subset of its neighborhood . If the cluster information is available to all agents, then the network can be split into two sub-networks, one for each cluster, as illustrated in Figs. (b)b and (c)c.
However, in this work we consider the more challenging scenario in which the cluster information is only partially available to the agents beforehand, or even completely unavailable. When the cluster information is completely absent, each agent must first identify neighbors belonging to . When the cluster information is partially known, meaning that some agents from the same cluster already know each other, then these agents can cooperate to identify the other members in their cluster. In order to study these two scenarios in a uniform manner, we introduce the concept of a group.
Definition 2 (Group)
A group , denoted by , is a collection of connected agents from the same cluster and knowing that they belong to this same cluster. ∎
Figure (d)d illustrates the concept of groups when cluster information is only partially available to the agents in the network from Fig. (a)a. If an agent has no information about its neighbors, then it falls into a singleton group, such as groups and in Fig. (d)d. If some neighboring agents know the cluster information of each other, then they form a non-trivial group, such as groups , , and . If every agent in a cluster knows the cluster information of all its neighbors, then all cluster members form one group and this group coincides with the cluster itself, as shown in Fig. (b)b.
Since cooperation among neighbors belonging to different clusters can lead to biased results [37, 3, 25], agents should only cooperate within clusters. However, when agents have access to partial cluster information, then they only know their group neighbors but not all cluster neighbors. Therefore, at this stage, agents can only cooperate within groups, leaving behind some potential opportunity for cooperation with neighbors from the same cluster. The purpose of this work is to devise a procedure to enable agents to identify all of their cluster neighbors, such that small groups from the same cluster can merge automatically into larger groups. At the same time, the procedure needs to be able to turn off links between different clusters in order to avoid interference. By using such a procedure, agents in multi-task networks with partial cluster information will be able to cluster themselves in an adaptive manner, and then solve problem (1) by solving (2) collaboratively within each cluster. We shall examine closely the probability of successful clustering and evaluate the steady-state mean-square-error performance for the overall learning process. In particular, we will show that the probability of correct clustering approaches one for sufficiently small step-sizes. We will also show that, with the enhanced cooperation that results from adaptive clustering, the mean-square-error performance for the network will be improved relative to the network without adaptive clustering.
Iii Models and Assumptions
We summarize the main conditions on the network topology in the following statement.
Assumption 1 (Topology, clusters, and groups)
The network consists of clusters, . The size of cluster is denoted by such that and .
The underlying topology for each cluster is connected. Clusters are also inter-connected by some links so that agents from different clusters may still be neighbors of each other.
There is a total of groups, , in the network. The size of group is denoted by such that and . ∎
It is obvious that because each cluster has at least one group and each group has at least one agent.
Definition 3 (Indexing rule)
Without loss of generality, we index groups according to their cluster indexes such that groups from the same cluster will have consecutive indexes. Likewise, we index agents according to their group indexes such that agents from the same group will have consecutive indexes. ∎
According to this indexing rule, if group belongs to cluster , then the next group will belong either to cluster or the next cluster, ; if agent belongs to group , then the next agent will belong either to group or the next group, .
Based on the problem formulation in Section II, although agents in the same cluster are connected, they are generally not aware of each other’s cluster information, and therefore some agents in the same cluster may not cooperate in the initial stage of adaptation. On the other hand, agents in the same group are aware of each other’s cluster information, so these agents can cooperate. As the learning process proceeds, agents from different groups in the same cluster will recognize each other through information sharing. Once cluster information is inferred, small groups will merge into larger groups, and agents will start cooperating with more neighbors. Through this adaptive clustering procedure, cooperative learning will grow until all agents within the same cluster become cooperative and the network performance is enhanced.
To proceed with the modeling assumptions, we introduce the following network Hessian matrix function:
where the vector collects the parameters from across the network:
We also collect the individual minimizers into a vector:
where the second equality is due to the indexing rule in Definition 3, and denotes an vector with all its entries equal to one. We next list two standard assumptions for stochastic distributed learning over adaptive networks to guide the subsequent analysis in this work. One assumption relates to the analytical properties of the cost functions, and is meant to ensure well-defined minima and well-posed problems. The second assumption relates to stochastic properties of the gradient noise processes that result from approximating the true gradient vectors. This assumption is meant to ensure that the gradient approximations are unbiased and with moments satisfying some regularity conditions. Explanations and motivation for these assumptions in the context of inference problems can be found in [38, 2, 3].
Assumption 2 (Cost functions)
Each individual cost is assumed to be strictly-convex, twice-differentiable, and with bounded Hessian matrix function satisfying:
In each group , at least one individual cost, say, , is strongly-convex, meaning that the lower bound, , on the Hessian of this cost is positive.
The network Hessian function in (4) satisfies the Lipschitz condition:
for any and some . ∎
The second set of assumptions relate to conditions on the gradient noise processes. For this purpose, we introduce the filtration to represent the information flow that is available up to the -th iteration of the learning process. The true network gradient function and its stochastic approximation are respectively denoted by
The gradient noise at iteration and agent is denoted by:
where denotes the estimate for that is available to agent at iteration . The network gradient noise is denoted by and is the random process that is obtained by aggregating all noise processes from across the network into a vector:
Using (11), we can write
We denote the conditional covariance of by
where is in .
Assumption 3 (Gradient noise)
It is assumed that the gradient noise process satisfies the following properties for any in :
It is easy to verify from (16) that the second-order moment of the gradient noise process also satisfies:
Iv Proposed Algorithm and Main Results
In order to minimize all cluster cost functions defined by (2), agents need to cooperate only within their clusters. Although cluster information is in general not available beforehand, groups within each cluster are available according to Assumption 1. Therefore, based on this prior information, agents can instead focus on solving the following problem based on partitioning by groups rather than by clusters:
with one parameter vector for each group . In the extreme case when prior clustering information is totally absent, groups will collapse into singletons and problem (20) will reduce to the individual non-cooperative case with each agent running its own stochastic-gradient algorithm to minimize its cost function. In another extreme case when cluster information is completely available, groups will be equivalent to clusters and problem (20) will reduce to the formation in (1). Therefore, problem (20) is general and includes many scenarios of interest as special cases. We shall argue in the sequel that during the process of solving (20), agents will be able to gradually learn their neighbors’ clustering information. This information will be exploited by a separate learning procedure by each group to dynamically involve more neighbors (from outside the group) in local cooperation. In this way, we will be able to establish analytically that, with high probability, agents will be able to successfully solve problem (1) (and not just (20)) even without having the complete clustering information in advance.
with . For any agent belonging to group in cluster , i.e., , it is easy to verify that
Then, agents in group can seek the solution of in (21) by using the adapt-then-combine (ATC) diffusion learning strategy over , namely, \cref@addtoresetequationparentequation
for all , where denotes the step-size parameter, and are convex combination coefficients that satisfy
Moreover, denotes the random estimate computed by agent at iteration , and is the intermediate iterate. We collect the coefficients into a matrix . Obviously, is a left-stochastic matrix, namely,
The procedure used by the agents to enlarge their groups will be based on the following results to be established in later sections. We will show in Theorem 3 that after sufficient iterations, i.e., as , and for small enough step-sizes, i.e., for all , the network estimate defined by (27) exhibits a distribution that is nearly Gaussian:
where denotes a Gaussian distribution with mean and covariance , is from (6),
and is a symmetric, positive semi-definite matrix, independent of , and defined later by (118). In addition, we will show that for any pair of agents from two different groups, for example, and , where the two groups and may or may not originate from the same cluster, the difference between their estimates will also be distributed approximately according to a Gaussian distribution:
is a symmetric, positive semi-definite matrix, and denotes the -th block of with block size . These results are useful for inferring the cluster information for agents and . Indeed, since the covariance matrix in (30) is on the order of , the probability density function (pdf) of will concentrate around its mean, namely, , when is sufficiently small. Therefore, if these agents belong to the same cluster such that , then we will be able to conclude from (30) that with high probability, . On the other hand, if the agents belong to different clusters such that , then it will hold with high probability that . This observation suggests that a hypothesis test can be formulated for agents and to determine whether or not they are members of the same cluster:
where denotes the hypothesis , denotes the hypothesis , and is a predefined threshold. Both agents and will test (32) to reach a symmetric pattern of cooperation. Since and are accessible through local interactions within neighborhoods, the hypothesis test (32) can be carried out in a distributed manner. We will further show that the probabilities for both types of errors incurred by (32), i.e., the false alarm (Type-I) and the missing detection (Type-II) errors, decay at exponential rates, namely,
for some constants and . Therefore, for long enough iterations and small enough step-sizes, agents are able to successfully infer the cluster information with very high probability.
The clustering information acquired at each iteration is used by the agents to dynamically adjust their inferred cluster neighborhoods. The for agent at iteration consists of the neighbors that are accepted under hypothesis and the other neighbors that are already in the same group:
Using these dynamically-evolving cluster neighborhoods, we introduce a separate ATC diffusion learning strategy: \cref@addtoresetequationparentequation
where the combination coefficients become random because is random and may vary over iterations. The iteration index is used for these coefficients to enforce causality. Since denotes the neighbors of agent that are already in the same group as , it is obvious that for any . This means that recursion (34a)–(34b) generally involves a larger range of interactions among agents than the first recursion (23a)–(23b). We summarize the algorithm in the following listing.
V Mean-Square-Error Analysis
In the previous section, we mentioned that Theorem 3 in Section VI-A is the key result for the design of the clustering criterion. To arrive this theorem, we shall derive two useful intermediate results, Lemmas 1 and 2, in this section. These two results are related to the MSE analysis of the first recursion (23a)–(23b), which is used in step (1) of the proposed algorithm. We shall therefore examine the stability and the MSE performance of recursion (23a)–(23b) in the sequel. It is clear that the evolution of this recursion is not influenced by the other two steps. Thus, we can study recursion (23a)–(23b) independently.
V-a Network Error Recursion
We introduce the network error vector:
where is from (6), and the individual error vectors:
where is from (4). Since consists of individual minimizers throughout the network, it follows that . Let
Then, expression (40) can be rewritten as
where each collects the combination coefficients within group :
From the same condition (24), we have that each is itself an left-stochastic matrix:
If group is a subset of cluster , then the agents in share the same minimizer at . Thus, for any , let
We denote the coefficient matrix appearing in (52) by
Then, the network error recursion (52) can be rewritten as
We further introduce the group quantities:
It follows from the indexing rule in Definition 3 that
Due to the block structures in (60)–(65), groups are isolated from each other. Therefore, using these group quantities, the network error recursion (54) is automatically decoupled into a total of group error recursions, where the -th recursion is given by
V-B Mean-Square and Mean-Fourth-Order Error Stability
The stability of the network error recursion (54) is now reduced to studying the stability of the group recursions (67). Recall that, by Definition 2, the agents in each group are connected. Moreover, condition (24) implies that agents in each group have non-trivial self-loops, meaning that for all . It follows that each is a primitive matrix [42, 2] (which is satisfied as long as there exists at least one in each group). Under these conditions, we are now able to ascertain the stability of the second and fourth-order error moments of the network error recursion (54) by appealing to results from .
Theorem 1 (Stability of error moments)
For sufficiently small step-sizes, the network error recursion (54) is mean-square and mean-fourth-order stable in the sense that
It is obvious that the network error recursion (54) is mean-square and mean-fourth-order stable if, and only if, each group error recursion (67) is stable in a similar sense. From Assumption 2, we know that there exists at least one strongly-convex cost in each group. Since the combination matrix for each group is primitive and left-stochastic, we can now call upon Theorems 9.1 and 9.2 from [3, p. 508, p. 522] to conclude that every group error recursion is mean-square and mean-fourth-order stable, namely,
V-C Long-Term Model
Once network stability is established, we can proceed to assess the performance of the adaptive clustering and learning procedure. To do so, it becomes more convenient to first introduce a long-term model for the error dynamics (54). Note that recursion (54) represents a non-linear, time-variant, and stochastic system that is driven by a state-dependent random noise process. Analysis of recursion (54) is facilitated by noting (see Lemma 1 below) that when the step-size parameter is small enough, the mean-square behavior of (54) in steady-state, when , can be well approximated by the behavior of the following long-term model:
where we replaced the random matrix in (54) by the constant matrix
In (73), the matrix is defined by
Note that the long-term model (72) is now a linear time-invariant system, albeit one that continues to be driven by the same random noise process as in (54). Similarly to the original error recursion (54), the long-term recursion (72) can also be decoupled into recursions, one for each group:
Lemma 1 (Accuracy of long-term model)
V-D Low-Dimensional Model
Lemma 1 indicates that we can assess the MSE dynamics of the original network recursion (54) to first-order in by working with the long-term model (72). It turns out that the state variable of the long-term model can be split into two parts, one consisting of the centroids of each group and the other consisting of in-group discrepancies. The details of this splitting are not important for our current discussion but interested readers can refer to Sec. V of  and Eq. (10.37) of [3, p. 558] for a detailed explanation. Here we only use this fact to motivate the introduction of the low-dimensional model. Moreover, it also turns out that the first part, i.e, the part corresponding to the centroids, is the dominant component in the evolution of the error dynamics and that the evolution of the two parts (centroids and in-group discrepancies) is weakly-coupled. By retaining the first part, we can therefore arrive at a low-dimensional model that will allow us to assess performance in closed-form to first-order in . To arrive at the low-dimensional model, we need to exploit the eigen-structure of the combination matrix , or, equivalently, that of each .
Recall that we indicated earlier prior to the statement of Theorem 1 that each is a primitive and left-stochastic matrix. By the Perron-Frobenius theorem [42, 43, 3], it follows that each has a simple eigenvalue at one with all other eigenvalues lying strictly inside the unit circle. Moreover, if we let denote the right-eigenvector of that is associated with the eigenvalue at one, and normalize its entries to add up to one, then the same theorem ensures that all entries of will be positive:
where denotes the -th entry of . This means that we can express each in the form (see (168) further ahead):
for some eigenvector matrices and , and where denotes the collection of the Jordan blocks with eigenvalues inside the unit circle and with their unit entries on the first lower sub-diagonal replaced by some arbitrarily small constant . The first rank-one component on the RHS of (84) represents the contribution by the largest eigenvalue of , and this component will be used further ahead to describe the centroid of group . The network Perron eigenvector is obtained by stacking the group Perron eigenvectors :
where denotes the -th entry of . According to the indexing rule from Definition 3, it is obvious that .
Now, for each group , we introduce the low-dimensional (centroid) error recursion defined by (compare with (76)):
where is , and is and defined by