A CommunicationEfficient Algorithm for Exponentially Fast NonBayesian Learning in Networks
Abstract
We introduce a simple timetriggered protocol to achieve communicationefficient nonBayesian learning over a network. Specifically, we consider a scenario where a group of agents interact over a graph with the aim of discerning the true state of the world that generates their joint observation profiles. To address this problem, we propose a novel distributed learning rule wherein agents aggregate neighboring beliefs based on a minprotocol, and the intercommunication intervals grow geometrically at a rate . Despite such sparse communication, we show that each agent is still able to rule out every false hypothesis exponentially fast with probability , as long as is finite. For the special case when communication occurs at every timestep, i.e., when , we prove that the asymptotic learning rates resulting from our algorithm are networkstructure independent, and a strict improvement upon those existing in the literature. In contrast, when , our analysis reveals that the asymptotic learning rates vary across agents, and exhibit a nontrivial dependence on the network topology coupled with the relative entropies of the agents’ likelihood models. This motivates us to consider the problem of allocating signal structures to agents to maximize appropriate performance metrics. In certain special cases, we show that the eccentricity centrality and the decay centrality of the underlying graph help identify optimal allocations; for more general scenarios, we bound the deviation from the optimal allocation as a function of the parameter , and the diameter of the communication graph.
I Introduction
A typical problem in networked systems involves a global task that needs to be accomplished by a group of entities or agents via local computations and information exchanges over the network. These agents, however, are typically endowed with partial information about the state of the system; as such, interagent communication becomes indispensable for achieving the common goal. Given this premise, it is natural to ask: how frequently must the agents communicate to solve the desired problem? Owing to its practical relevance, the question posed above has received significant recent interest by the control system, information theory and machine learning communities in the context of a variety of problems, namely average consensus [olshevsky], optimization [opt1, opt2, opt3], and static parameter estimation [sahu]. Our goal in this paper is to extend such investigations to the problem of nonBayesian learning in a network, also known as the distributed hypothesis testing problem [jad1, jad2, shahin, nedic, lalitha, mitraACC19]. Specifically, the global task in this setting involves learning the true state of the world (among a finite set of hypotheses) that explains the private observations of each agent in the network. Two notable features that are specific to this problem are as follows. Unlike consensus or distributed optimization, agents are privy to exogenous signals, which, if informative, can enable them to eliminate a subset of the false hypotheses exponentially fast. A related problem where agents receive exogenous signals (measurements) is that of distributed state estimation [martins, mitraTAC] where the global task entails tracking potentially unstable dynamics. In contrast, the true state of the world remains fixed over time in our setting, considerably simplifying the objective. These attributes play in favor of the problem at hand, motivating us to ask the following questions. (i) Can we design an algorithm that enables each agent to learn the truth with sparse communication schedules (and in fact, even sparser than typically employed for other classes of distributed problems)? (ii) If so, how fast do the agents learn the truth? (iii) Can we quantify the tradeoff(s) between sparsity in communication and the rate of learning? To the best of our knowledge, these questions remain largely unexplored. In this paper, we take a preliminary step towards responding to them via the following contributions.
We develop and analyze a simple timetriggered learning rule that builds on our recent work on distributed hypothesis testing [mitraACC19]. Specifically, the dataaggregation step of our algorithm involves a minprotocol as opposed to the consensusbased averaging schemes intrinsic to existing linear [jad1, jad2] and loglinear [shahin, nedic, lalitha] learning rules. The basic strategy we employ to achieve communicationefficiency is in line with those proposed in [olshevsky, opt1, sahu], where interagent communications become progressively sparser as time evolves. In particular, the authors in [olshevsky] and [opt1] explore deterministic rules where the intercommunication intervals grow logarithmically and polynomially in time, respectively. In contrast, the authors in [sahu] propose a rule where at each timestep, an agent communicates with its neighbors in the graph with a probability that decays to zero at a sublinear rate. In essence, these approaches establish that as long as the intercommunication intervals do not grow too fast, the global task can still be achieved. We depart from these approaches by allowing the intercommunication intervals to grow much faster: at a geometric rate , where the parameter can be adjusted to control the frequency of communication. While more refined approaches to achieve communicationefficiency are conceivable, we show that our simple timetriggered protocol yields strong guarantees. Specifically, we prove that even with an arbitrarily large (which leads to a highly sparse communication schedule), each agent is still able to learn the truth with probability , provided is finite. Furthermore, we establish that such learning occurs exponentially fast, and characterize the limiting error exponents as a function of certain parameters of our model, and the constant . In particular, our characterization quantifies the tradeoffs between communicationefficiency and the speed of learning for the specific problem under consideration.
Our analysis subsumes the special case when communication occurs at every timestep, i.e., when , which corresponds to the scenario studied in our previous work [mitraACC19]. While the general approach in [mitraACC19] was shown to be robust to worstcase adversarial attack models, a convergencerate analysis of the same was missing. A significant contribution of this paper is to fill this gap by establishing that when , the asymptotic learning rates resulting from our proposed algorithm are networkstructure independent, and a strict improvement over the rates provided by existing algorithms in the literature. In contrast, when , we show that the asymptotic learning rates differ from agent to agent, and depend not only on the relative entropies of the agents’ signal models, but also on properties of the underlying network. Given this result, we introduce two new measures of the quality of learning, and study the problem of allocating signal structures to agents to maximize such measures. In certain special cases, we show that the eccentricity centrality and the decay centrality of the communication network play key roles in identifying the optimal allocations. For more general cases, we bound the deviation from the optimal allocation as a function of the parameter , and the diameter of the graph.
Ii Model and Problem Formulation
Network Model: We consider a setting comprising of a group of agents . At certain specific timesteps (to be decided by a timetriggered communication schedule), these agents interact with each other over a directed graph . An edge indicates that agent can directly transmit information to agent ; in such a case, agent will be called a neighbor of agent . The set of all neighbors of agent will be denoted . For a stronglyconnected graph , we will use to denote the length of the shortest directed path from agent to agent , and to denote the diameter of the graph.^{1}^{1}1A graph is said to be stronglyconnected if it has a directed path between every pair of agents; the diameter of such a graph is the length of the longest shortest path between the agents.
Observation Model: Let denote possible states of the world, with each state representing a hypothesis. A specific state , referred to as the true state of the world, gets realized. Conditional on its realization, at each timestep , every agent privately observes a signal , where denotes the signal space of agent .^{2}^{2}2We use and to refer to the set of nonnegative integers and positive integers, respectively. The joint observation profile so generated across the network is denoted , where , and . Specifically, the signal is generated based on a conditional likelihood function , the th marginal of which is denoted , and is available to agent . The signal structure of each agent is thus characterized by a family of parameterized marginals . We make certain standard assumptions [jad1, jad2, shahin, nedic, lalitha]: (i) The signal space of each agent , namely , is finite. (ii) Each agent has knowledge of its local likelihood functions , and it holds that , and . (iii) The observation sequence of each agent is described by an i.i.d. random process over time; however, at any given timestep, the observations of different agents may potentially be correlated. (iv) There exists a fixed true state of the world (that is unknown to the agents) that generates the observations of all the agents. The probability space for our model is denoted , where , is the algebra generated by the observation profiles, and is the probability measure induced by sample paths in . Specifically, . We will use the abbreviation a.s. to indicate almost sure occurrence of an event w.r.t. .
Given the above setting, the goal of each agent in the network is to eventually learn the true state of the world . However, the signal structure of any given agent is in general only partially informative, thereby precluding this task from being achieved by any agent in isolation. Specifically, let represent the set of hypotheses that are observationally equivalent to the true state from the perspective of agent . An agent is deemed partially informative about the truth if . Since potentially every agent can be partially informative in the sense described above, interagent communications become necessary for each agent to learn the truth.
In this context, our objectives in this paper are to develop an understanding of (i) the amount of leeway that the above problem affords in terms of sparsifying interagent communications without compromising the objective of learning the truth, and (ii) the tradeoffs between sparse communication and the rate of learning. To this end, we recall the following definition from [mitraACC19] that will prove useful in our subsequent developments.
Definition 1.
(Source agents) An agent is said to be a source agent for a pair of distinct hypotheses if it can distinguish between them, i.e., if , where represents the KLdivergence [cover] between the distributions and . The set of source agents for pair is denoted .
Throughout the rest of the paper, we will use as a shorthand for .
Iii A CommunicationEfficient Learning Rule
In this section, we formally introduce a simple timetriggered belief update rule parameterized by a constant that determines the frequency of communication (to be made more precise below). In order to collaboratively learn the true state of the world, every agent maintains a local belief vector , and an actual belief vector , each of which are probability distributions over the hypothesis set . These vectors are initialized with (but otherwise arbitrarily), and subsequently updated as follows.

Update of the local beliefs: At each timestep , the local belief vectors are updated based on a standard Bayesian rule:
(1) 
Update of the actual beliefs: Let denote a sequence of timesteps satisfying , with . If , then is updated as
(2) If , is simply held constant as follows:
(3)
In words, while the local beliefs are updated at every timestep, the actual beliefs are updated only at timesteps that belong to the set , i.e., an agent is allowed to transmit to its outneighbors, and receive from each inneighbor in if and only if . When , the actual beliefs get updated as per (2) at every timestep, and we recover the rule proposed in [mitraACC19]. When , note that the intercommunication intervals grow exponentially at a rate dictated by the parameter . Our goal in this paper is to precisely characterize the impact of such sparse communication on the asymptotic rate of learning of each agent. Prior to doing so, a few comments are in order.
First, notice that the dataaggregation rule in (2) is based on a minprotocol, as opposed to any form of “beliefaveraging” commonly employed in the existing distributed learning literature [jad1, jad2, shahin, nedic, lalitha]. Essentially, while the local belief updates (1) capture what an agent can learn by itself, the actual belief updates (2) incorporate information from the rest of the network. As demonstrated by Corollary 1 in the next section, when , such a minprotocol yields better asymptotic learning rates than all existing schemes. This motivates us to use a belief update rule of the form (2) for studying the case when . Second, we note that the proposed timetriggered protocol is simple, easy to implement and computationally cheap. At the same time, the exponentially growing intervals afford a much sparser communication schedule relative to related literature. Third, while one can potentially consider extensions of this algorithm that account for asynchronicity, communication failures, delays etc., we focus on the scheme here in order to (i) concretely isolate the tradeoff between sparse communication and the quality of learning as measured by the asymptotic learning rates, and (ii) provide insights into how the network structure impacts such rates. A final comment needs to made regarding the choice of achieving communicationefficiency by cutting down on communication rounds as opposed to truncating the number of bits exchanged per communication round, an approach pursued in quantizationbased schemes [suresh]. As argued in [opt2], communication latency acts as the bottleneck of overall performance and dominates messagesize dependent transmission latency when it comes to transmitting small messages, such as the dimensional actual belief vectors in our setting. This justifies our sparse communication scheme. With these points in mind, we proceed to the analysis of the algorithm developed in this section.
Iv Main Result and Discussion
The main result of the paper is as follows; the proof of this result is presented in Section V.
Theorem 1.
Suppose the communication parameter satisfies , and the following conditions are met.

For every pair of hypotheses , the corresponding source set is nonempty.

The communication graph is stronglyconnected.

Every agent has a nonzero prior belief on each hypothesis, i.e., for all , and for all .
Then, the timetriggered distributed learning rule described by equations (1), (2), (3) provides the following guarantees.

(Consistency): For each agent , a.s.

(Asymptotic Rate of Rejection of False Hypotheses): For each agent , and for each false hypothesis the following holds:
(4)
We obtain the following important corollary, the proof of which follows readily from that of Theorem 1 in Section V.
Corollary 1.
Suppose communication occurs at every timestep, i.e., suppose . Let conditions (i)(iii) in the statement of Theorem 1 hold. Then, our proposed learning rule guarantees consistency in the same sense as in Theorem 1. Furthermore, for each agent , and for each false hypothesis , the following holds:
(5) 
We remark on the implications of the above results.
Implications of Theorem 1: We first note that despite its simplicity, the timetriggered algorithm proposed in Section III provides strong guarantees: Eqn. (4) indicates that although the intercommunication intervals grow exponentially at an arbitrarily large (but finite) rate , each agent is still able to eliminate every false hypothesis at an exponential rate with probability . More interestingly, (4) reveals that the asymptotic learning rates are agentspecific, i.e., different agents may discover the truth at different rates.^{3}^{3}3We use the lower bounds derived in (4), (5) as a proxy when referring to the corresponding asymptotic learning rates. In particular, when considering the asymptotic rate of rejection of a particular false hypothesis at a given agent , notice from the RHS of (4) that one needs to account for the attenuated relative entropies of the corresponding source agents, where the attenuation factor scales exponentially with the distances of agent from such source agents. This contrasts with existing literature [jad1, jad2, shahin, lalitha, nedic], and the case when in Corollary 1, where all agents learn the truth at identical rates.
Implications of Corollary 1: In sharp contrast to the case when , Corollary 1 indicates that when communication occurs at every timestep (i.e., ), the asymptotic learning rates are networkstructure independent, and identical for each agent. Since this case represents the standard distributed hypothesis testing setup studied in literature, it becomes important to know how such rates compare with those resulting from existing “beliefaveraging” schemes [jad1, jad2, shahin, lalitha, nedic]. To this end, we note that under the same set of assumptions as in Theorem 1, both linear [jad1, jad2] and loglinear [shahin, lalitha, nedic] opinion pooling lead to an asymptotic rate of rejection of the form for each false hypothesis , and the rate is identical for each agent. Here, represents the eigenvector centrality of agent . It is well known that for a stronglyconnected graph, . Thus, based on the above discussion, and referring to (5), we conclude that a significant contribution of the algorithm proposed in this paper is that it yields strictly better asymptotic learning rates than those existing in the literature, for the standard setting when .^{4}^{4}4Recently, in [mitraTAC19], we showed that this result continues to hold even if the underlying graph changes with time, but satisfies a mild jointstrong connectivity condition.
TradeOff between Sparse Communication and Quality of Learning: From (4), it is apparent that sparser communication schedules (corresponding to larger ’s) come at the cost of lower asymptotic learning rates. Furthermore, since such rates depend upon the networkstructure when , a poor allocation of signal structures to agents can have adverse effects on the learning rates of certain agents. However, the above problem is readily bypassed when , since the learning rates for that case solely depend on the relative entropies of the agents, as shown by (5).
V Proof of the Main Result
In order to prove Theorem 1, we require a few intermediate results. The first one is a standard consequence of Bayesian updating, and characterizes the behavior of the local belief trajectories generated via (1); for a proof, see [mitraACC19].
Lemma 1.
Consider a false hypothesis , and an agent . Suppose . Then, the update rule (1) ensures that (i) a.s., (ii) exists a.s. and satisfies , and (iii) the following holds:
(6) 
Lemma 2.
Proof.
Let denote the set of sample paths for which the assertions in Lemma 1 hold for each false hypothesis . Based on Lemma 1, we note that . Consequently, to prove the result, it suffices to establish the existence of , and such that (7) holds for each sample path . To this end, pick an arbitrary sample path We first argue that the local beliefs of every agent on the true state are bounded away from on . To see this, pick any agent . Suppose there exists some for which . Then, based on our choice of , it follows directly from Lemma 1 that , where the last inequality follows from condition (iii) in Theorem 1. In particular, given the structure of the update rule (1), it follows that for all time (since if at any instant, then the corresponding belief would remain at for all subsequent timesteps, thereby violating the fact that ). If there exists no for which , then every hypothesis in is observationally equivalent to from the point of view of agent . In this case, it is easy to see that based on (1), . In particular, this implies . This establishes our claim that on , the local beliefs of all the agents remain bounded away from .
To proceed, define , where the inequality follows from condition (iii) in Theorem 1. Pick a small number such that , and notice that our discussion concerning the evolution of the local beliefs readily implies the existence of a timestep , such that for all , . Now define , and observe that . This observation follows from the fact that given the structure of the update rules (2) and (3), and condition (iii) in Theorem 1, can equal if and only if some agent in the network sets its local belief on to at some timestep prior to . However, this possibility is ruled out in view of the previously established fact that on , . Let . It is apparent from the preceding discussion that . It remains to establish a similar result for the actual beliefs . To this end, let be the first timestep following that belongs to the set . Based on (3), notice that for all , and for each . Based on (2), at timestep , for an agent satisfies:
(8) 
where the last equality follows from the fact that the local belief vectors generated via (1) are valid probability distributions over the hypothesis set at each timestep, and hence . The above argument applies identically to each agent in . Furthermore, it is easily seen that based on (3), and a similar reasoning as above, identical conclusions can be drawn for each timestep when the agents update their actual beliefs based on (2). This readily establishes (7), and completes the proof. ∎
Lemma 3.
Proof.
Throughout this proof, we use the same notation as in the proof of Lemma 2. With as in Lemma 2, pick an arbitrary sample path , an agent , and an agent . Since condition (ii) in Theorem 1 is met, there exists a directed path of shortest length from agent to agent in . To prove the result, we shall induct on the length of such a path. First, we consider the base case when , i.e., when . In other words, we will analyze the asymptotic rate of rejection of at the source agent . Fix , and notice that since , Lemma 1 implies that there exists , such that:
(10) 
Since , Lemma 2 guarantees the existence of a timestep , and a constant , such that on , . Let . For the remainder of the proof, to simplify the notation, we suppress the dependence of various quantities on the parameters , and , since such dependence can be easily inferred from context. Let be the first timestep following that belongs to , i.e., a timestep when agent updates its actual beliefs based on (2). Then, based on the preceding discussion and (2), we have:
(11) 
where . Regarding the inequalities in (11), (a) follows directly from (2), whereas (b) follows from (10) and the fact that lower bounds the beliefs (both local and actual) of all agents on the true state . Note that consecutive triggerpoints satisfy . Based on (3), we then have:
(12) 
Based on our rule, the next update of takes place at timestep . Employing the same reasoning as we did to arrive at (11), we obtain:
(13) 
Coupled with the above inequality, (3) once again implies:
(14) 
Generalizing the above reasoning, we obtain:
(15) 
, , where
(16) 
This immediately leads to the conclusion that for any :
(17) 
where
(18) 
Taking the natural log on both sides of (17), dividing throughout by , and simplifying, we obtain that :
(19) 
Let . Then, taking the limit inferior on both sides of the above inequality yields:
(20) 
where the second inequality follows from the fact that , and the final equality results from further simplifications based on (18). Finally, note that can be made arbitrarily small in the above inequality, and that the above conclusions hold for a generic sample path , where . This establishes (9) for the case when , and completes the proof of the base case of our induction. To proceed, suppose (9) holds for each node satisfying , where is a nonnegative integer satisfying (recall that represents the diameter of the graph ). Let be such that Thus, there must exist some node such that The induction hypothesis applies to this node , and hence, we have:
(21) 
Let be the set of sample paths for which the above inequality holds. With defined as before, notice that , since and each have measure . Pick an arbitrary sample path , and notice that based on arguments identical to the base case, on the sample path there exists a timestep , such that the beliefs of all agents on are bounded below by following , and
(22) 
where is an arbitrary small number and
(23) 
Proceeding as in the base case, let be the first timestep following that belongs to the set . Noting that , using (2), (22), and similar arguments as those used to arrive at (11), we obtain:
(24) 
where
(25) 
Repeating the above analysis for each timestep of the form , using (3), and following similar arguments as in the base case yields that ,
(26) 
where
(27) 
Notice that the inequality in (26) resembles that in (17). Thus, the remaining steps can be completed identically as the base case to yield:
(28) 
The induction step, and in turn the proof can be completed by substituting the expression for in the above inequality and recalling that . ∎
We are now in position to prove Theorem 1.
Proof.
Vi Simulation Example
Consider a binary hypothesis testing scenario where , and is the true state of the world. The network of agents is depicted in Figure 1. The signal space for every agent is identical, and given by . The agent likelihood models satisfy: , and Thus, only agents and can distinguish between and , with their relative entropies satisfying (all other agents have . With , we have . Figure 2 plots the instantaneous rates of rejection of the false hypothesis for agents and , resulting from our proposed algorithm. Based on Figures 1 and 2, a few key observations are: (i) each informative agent dominates the speed of learning of agents that are close to it in the network, (ii) the rate of rejection of the false hypothesis is indeed agentspecific, and (iii) the simulation results agree very closely with the theoretical lower bounds on the limiting rates of rejection in Theorem 1.
Vii The Impact of Information Allocation on Asymptotic Learning Rates
Theorem 1 indicates that the asymptotic learning rates of the agents are shaped by a nontrivial interplay between the relative entropies of their signal models and the structure of the network. In view of this fact, our next goal is to conduct a preliminary analysis of how information should be allocated to the agents in order to maximize appropriate performance metrics that are a function of the asymptotic learning rates. Our investigation is inspired by similar questions in [jad2]; however, as we discuss next, our formulation differs considerably from [jad2]. Specifically, unlike [jad2], our proposed learning rule leads to asymptotic learning rates that are agentdependent when (as seen in Section VI). Consequently, the performance metrics that we seek to maximize differ from those in [jad2]. As we shall soon see, while the eigenvector centrality plays a key role in shaping the speed of learning in [jad2], alternate network centrality measures become important when it comes to the belief dynamics generated by our rule.
To make the above ideas precise, suppose we are given a stronglyconnected communication graph , and a set of signal structures , where each represents a family of parameterized marginals as defined in Section II. By an allocation of signal structures to agents, we imply a bijection between the elements of and the elements of the vertex set of , namely . Let represent the set of all possible bijections between the elements of and . Our objective is to optimally pick so as to maximize the performance metrics that we define next. To this end, given a distinct pair of hypotheses , recall from (4) that based on our proposed learning rule,
(29) 
lower bounds the limiting rate at which agent rules out when is realized as the true state; the superscript reflects the dependence of the corresponding objects on the allocation policy We now introduce two measures of the quality of learning that are specific to our setting:
(30) 
While captures the average rate of learning across the network, focuses on the agent that converges the slowest; given that any state in can be realized, these metrics account for the pair of states that are the hardest to tell apart. We seek to maximize and over the set of allocations . Our first result on this topic makes a connection to two popular network centrality measures, namely, the eccentricity centrality and the decay centrality, defined as follows. For a stronglyconnected graph , the eccentricity centrality [hage], and the decay centrality [tsakas], of an agent are given by
(31) 
where is the decay parameter.
The eccentricity centrality is a distancebased centrality measure that aims to find the ‘center’ of a graph such that a process originating at the center minimizes the response time to any other agent. The decay centrality is also a closenessbased centrality measure where an agent is rewarded for being close to other agents, with agents at higher distances contributing less to the centrality as compared to those that are closer. We have the following result.
Proposition 1.
Let be stronglyconnected. Suppose , and let there exist a signal structure such that the following is true for all ,
(32) 
Then, (i) any allocation such that maximizes , and (ii) any allocation such that maximizes .
Proof.
For part (i), consider two allocations such that , and . Based on condition (32), and (29), it is easy to see that for any pair , and for each :
(33) 
Based on (31), we then obtain:
(34) 
where the second equality follows from the fact that the signal structure of agent under allocation , and agent under allocation , are each equal to , and the last inequality follows by noting that based on the choice of agent . The proof of part (i) then follows by noting that the inequality in (34) holds for every pair For part (ii), we proceed as in part (i) and compare two allocations such that , and . The equalities in (33) hold once again, and combined with (31) lead to: