A Communication-Efficient Algorithm for Exponentially Fast Non-Bayesian Learning in Networks
We introduce a simple time-triggered protocol to achieve communication-efficient non-Bayesian learning over a network. Specifically, we consider a scenario where a group of agents interact over a graph with the aim of discerning the true state of the world that generates their joint observation profiles. To address this problem, we propose a novel distributed learning rule wherein agents aggregate neighboring beliefs based on a min-protocol, and the inter-communication intervals grow geometrically at a rate . Despite such sparse communication, we show that each agent is still able to rule out every false hypothesis exponentially fast with probability , as long as is finite. For the special case when communication occurs at every time-step, i.e., when , we prove that the asymptotic learning rates resulting from our algorithm are network-structure independent, and a strict improvement upon those existing in the literature. In contrast, when , our analysis reveals that the asymptotic learning rates vary across agents, and exhibit a non-trivial dependence on the network topology coupled with the relative entropies of the agents’ likelihood models. This motivates us to consider the problem of allocating signal structures to agents to maximize appropriate performance metrics. In certain special cases, we show that the eccentricity centrality and the decay centrality of the underlying graph help identify optimal allocations; for more general scenarios, we bound the deviation from the optimal allocation as a function of the parameter , and the diameter of the communication graph.
A typical problem in networked systems involves a global task that needs to be accomplished by a group of entities or agents via local computations and information exchanges over the network. These agents, however, are typically endowed with partial information about the state of the system; as such, inter-agent communication becomes indispensable for achieving the common goal. Given this premise, it is natural to ask: how frequently must the agents communicate to solve the desired problem? Owing to its practical relevance, the question posed above has received significant recent interest by the control system, information theory and machine learning communities in the context of a variety of problems, namely average consensus [olshevsky], optimization [opt1, opt2, opt3], and static parameter estimation [sahu]. Our goal in this paper is to extend such investigations to the problem of non-Bayesian learning in a network, also known as the distributed hypothesis testing problem [jad1, jad2, shahin, nedic, lalitha, mitraACC19]. Specifically, the global task in this setting involves learning the true state of the world (among a finite set of hypotheses) that explains the private observations of each agent in the network. Two notable features that are specific to this problem are as follows. Unlike consensus or distributed optimization, agents are privy to exogenous signals, which, if informative, can enable them to eliminate a subset of the false hypotheses exponentially fast. A related problem where agents receive exogenous signals (measurements) is that of distributed state estimation [martins, mitraTAC] where the global task entails tracking potentially unstable dynamics. In contrast, the true state of the world remains fixed over time in our setting, considerably simplifying the objective. These attributes play in favor of the problem at hand, motivating us to ask the following questions. (i) Can we design an algorithm that enables each agent to learn the truth with sparse communication schedules (and in fact, even sparser than typically employed for other classes of distributed problems)? (ii) If so, how fast do the agents learn the truth? (iii) Can we quantify the trade-off(s) between sparsity in communication and the rate of learning? To the best of our knowledge, these questions remain largely unexplored. In this paper, we take a preliminary step towards responding to them via the following contributions.
We develop and analyze a simple time-triggered learning rule that builds on our recent work on distributed hypothesis testing [mitraACC19]. Specifically, the data-aggregation step of our algorithm involves a min-protocol as opposed to the consensus-based averaging schemes intrinsic to existing linear [jad1, jad2] and log-linear [shahin, nedic, lalitha] learning rules. The basic strategy we employ to achieve communication-efficiency is in line with those proposed in [olshevsky, opt1, sahu], where inter-agent communications become progressively sparser as time evolves. In particular, the authors in [olshevsky] and [opt1] explore deterministic rules where the inter-communication intervals grow logarithmically and polynomially in time, respectively. In contrast, the authors in [sahu] propose a rule where at each time-step, an agent communicates with its neighbors in the graph with a probability that decays to zero at a sub-linear rate. In essence, these approaches establish that as long as the inter-communication intervals do not grow too fast, the global task can still be achieved. We depart from these approaches by allowing the inter-communication intervals to grow much faster: at a geometric rate , where the parameter can be adjusted to control the frequency of communication. While more refined approaches to achieve communication-efficiency are conceivable, we show that our simple time-triggered protocol yields strong guarantees. Specifically, we prove that even with an arbitrarily large (which leads to a highly sparse communication schedule), each agent is still able to learn the truth with probability , provided is finite. Furthermore, we establish that such learning occurs exponentially fast, and characterize the limiting error exponents as a function of certain parameters of our model, and the constant . In particular, our characterization quantifies the trade-offs between communication-efficiency and the speed of learning for the specific problem under consideration.
Our analysis subsumes the special case when communication occurs at every time-step, i.e., when , which corresponds to the scenario studied in our previous work [mitraACC19]. While the general approach in [mitraACC19] was shown to be robust to worst-case adversarial attack models, a convergence-rate analysis of the same was missing. A significant contribution of this paper is to fill this gap by establishing that when , the asymptotic learning rates resulting from our proposed algorithm are network-structure independent, and a strict improvement over the rates provided by existing algorithms in the literature. In contrast, when , we show that the asymptotic learning rates differ from agent to agent, and depend not only on the relative entropies of the agents’ signal models, but also on properties of the underlying network. Given this result, we introduce two new measures of the quality of learning, and study the problem of allocating signal structures to agents to maximize such measures. In certain special cases, we show that the eccentricity centrality and the decay centrality of the communication network play key roles in identifying the optimal allocations. For more general cases, we bound the deviation from the optimal allocation as a function of the parameter , and the diameter of the graph.
Ii Model and Problem Formulation
Network Model: We consider a setting comprising of a group of agents . At certain specific time-steps (to be decided by a time-triggered communication schedule), these agents interact with each other over a directed graph . An edge indicates that agent can directly transmit information to agent ; in such a case, agent will be called a neighbor of agent . The set of all neighbors of agent will be denoted . For a strongly-connected graph , we will use to denote the length of the shortest directed path from agent to agent , and to denote the diameter of the graph.111A graph is said to be strongly-connected if it has a directed path between every pair of agents; the diameter of such a graph is the length of the longest shortest path between the agents.
Observation Model: Let denote possible states of the world, with each state representing a hypothesis. A specific state , referred to as the true state of the world, gets realized. Conditional on its realization, at each time-step , every agent privately observes a signal , where denotes the signal space of agent .222We use and to refer to the set of non-negative integers and positive integers, respectively. The joint observation profile so generated across the network is denoted , where , and . Specifically, the signal is generated based on a conditional likelihood function , the -th marginal of which is denoted , and is available to agent . The signal structure of each agent is thus characterized by a family of parameterized marginals . We make certain standard assumptions [jad1, jad2, shahin, nedic, lalitha]: (i) The signal space of each agent , namely , is finite. (ii) Each agent has knowledge of its local likelihood functions , and it holds that , and . (iii) The observation sequence of each agent is described by an i.i.d. random process over time; however, at any given time-step, the observations of different agents may potentially be correlated. (iv) There exists a fixed true state of the world (that is unknown to the agents) that generates the observations of all the agents. The probability space for our model is denoted , where , is the -algebra generated by the observation profiles, and is the probability measure induced by sample paths in . Specifically, . We will use the abbreviation a.s. to indicate almost sure occurrence of an event w.r.t. .
Given the above setting, the goal of each agent in the network is to eventually learn the true state of the world . However, the signal structure of any given agent is in general only partially informative, thereby precluding this task from being achieved by any agent in isolation. Specifically, let represent the set of hypotheses that are observationally equivalent to the true state from the perspective of agent . An agent is deemed partially informative about the truth if . Since potentially every agent can be partially informative in the sense described above, inter-agent communications become necessary for each agent to learn the truth.
In this context, our objectives in this paper are to develop an understanding of (i) the amount of leeway that the above problem affords in terms of sparsifying inter-agent communications without compromising the objective of learning the truth, and (ii) the trade-offs between sparse communication and the rate of learning. To this end, we recall the following definition from [mitraACC19] that will prove useful in our subsequent developments.
(Source agents) An agent is said to be a source agent for a pair of distinct hypotheses if it can distinguish between them, i.e., if , where represents the KL-divergence [cover] between the distributions and . The set of source agents for pair is denoted .
Throughout the rest of the paper, we will use as a shorthand for .
Iii A Communication-Efficient Learning Rule
In this section, we formally introduce a simple time-triggered belief update rule parameterized by a constant that determines the frequency of communication (to be made more precise below). In order to collaboratively learn the true state of the world, every agent maintains a local belief vector , and an actual belief vector , each of which are probability distributions over the hypothesis set . These vectors are initialized with (but otherwise arbitrarily), and subsequently updated as follows.
Update of the local beliefs: At each time-step , the local belief vectors are updated based on a standard Bayesian rule:
Update of the actual beliefs: Let denote a sequence of time-steps satisfying , with . If , then is updated as
If , is simply held constant as follows:
In words, while the local beliefs are updated at every time-step, the actual beliefs are updated only at time-steps that belong to the set , i.e., an agent is allowed to transmit to its out-neighbors, and receive from each in-neighbor in if and only if . When , the actual beliefs get updated as per (2) at every time-step, and we recover the rule proposed in [mitraACC19]. When , note that the inter-communication intervals grow exponentially at a rate dictated by the parameter . Our goal in this paper is to precisely characterize the impact of such sparse communication on the asymptotic rate of learning of each agent. Prior to doing so, a few comments are in order.
First, notice that the data-aggregation rule in (2) is based on a min-protocol, as opposed to any form of “belief-averaging” commonly employed in the existing distributed learning literature [jad1, jad2, shahin, nedic, lalitha]. Essentially, while the local belief updates (1) capture what an agent can learn by itself, the actual belief updates (2) incorporate information from the rest of the network. As demonstrated by Corollary 1 in the next section, when , such a min-protocol yields better asymptotic learning rates than all existing schemes. This motivates us to use a belief update rule of the form (2) for studying the case when . Second, we note that the proposed time-triggered protocol is simple, easy to implement and computationally cheap. At the same time, the exponentially growing intervals afford a much sparser communication schedule relative to related literature. Third, while one can potentially consider extensions of this algorithm that account for asynchronicity, communication failures, delays etc., we focus on the scheme here in order to (i) concretely isolate the trade-off between sparse communication and the quality of learning as measured by the asymptotic learning rates, and (ii) provide insights into how the network structure impacts such rates. A final comment needs to made regarding the choice of achieving communication-efficiency by cutting down on communication rounds as opposed to truncating the number of bits exchanged per communication round, an approach pursued in quantization-based schemes [suresh]. As argued in [opt2], communication latency acts as the bottleneck of overall performance and dominates message-size dependent transmission latency when it comes to transmitting small messages, such as the -dimensional actual belief vectors in our setting. This justifies our sparse communication scheme. With these points in mind, we proceed to the analysis of the algorithm developed in this section.
Iv Main Result and Discussion
The main result of the paper is as follows; the proof of this result is presented in Section V.
Suppose the communication parameter satisfies , and the following conditions are met.
For every pair of hypotheses , the corresponding source set is non-empty.
The communication graph is strongly-connected.
Every agent has a non-zero prior belief on each hypothesis, i.e., for all , and for all .
(Consistency): For each agent , a.s.
(Asymptotic Rate of Rejection of False Hypotheses): For each agent , and for each false hypothesis the following holds:
Suppose communication occurs at every time-step, i.e., suppose . Let conditions (i)-(iii) in the statement of Theorem 1 hold. Then, our proposed learning rule guarantees consistency in the same sense as in Theorem 1. Furthermore, for each agent , and for each false hypothesis , the following holds:
We remark on the implications of the above results.
Implications of Theorem 1: We first note that despite its simplicity, the time-triggered algorithm proposed in Section III provides strong guarantees: Eqn. (4) indicates that although the inter-communication intervals grow exponentially at an arbitrarily large (but finite) rate , each agent is still able to eliminate every false hypothesis at an exponential rate with probability . More interestingly, (4) reveals that the asymptotic learning rates are agent-specific, i.e., different agents may discover the truth at different rates.333We use the lower bounds derived in (4), (5) as a proxy when referring to the corresponding asymptotic learning rates. In particular, when considering the asymptotic rate of rejection of a particular false hypothesis at a given agent , notice from the RHS of (4) that one needs to account for the attenuated relative entropies of the corresponding source agents, where the attenuation factor scales exponentially with the distances of agent from such source agents. This contrasts with existing literature [jad1, jad2, shahin, lalitha, nedic], and the case when in Corollary 1, where all agents learn the truth at identical rates.
Implications of Corollary 1: In sharp contrast to the case when , Corollary 1 indicates that when communication occurs at every time-step (i.e., ), the asymptotic learning rates are network-structure independent, and identical for each agent. Since this case represents the standard distributed hypothesis testing setup studied in literature, it becomes important to know how such rates compare with those resulting from existing “belief-averaging” schemes [jad1, jad2, shahin, lalitha, nedic]. To this end, we note that under the same set of assumptions as in Theorem 1, both linear [jad1, jad2] and log-linear [shahin, lalitha, nedic] opinion pooling lead to an asymptotic rate of rejection of the form for each false hypothesis , and the rate is identical for each agent. Here, represents the eigenvector centrality of agent . It is well known that for a strongly-connected graph, . Thus, based on the above discussion, and referring to (5), we conclude that a significant contribution of the algorithm proposed in this paper is that it yields strictly better asymptotic learning rates than those existing in the literature, for the standard setting when .444Recently, in [mitraTAC19], we showed that this result continues to hold even if the underlying graph changes with time, but satisfies a mild joint-strong connectivity condition.
Trade-Off between Sparse Communication and Quality of Learning: From (4), it is apparent that sparser communication schedules (corresponding to larger ’s) come at the cost of lower asymptotic learning rates. Furthermore, since such rates depend upon the network-structure when , a poor allocation of signal structures to agents can have adverse effects on the learning rates of certain agents. However, the above problem is readily bypassed when , since the learning rates for that case solely depend on the relative entropies of the agents, as shown by (5).
V Proof of the Main Result
In order to prove Theorem 1, we require a few intermediate results. The first one is a standard consequence of Bayesian updating, and characterizes the behavior of the local belief trajectories generated via (1); for a proof, see [mitraACC19].
Consider a false hypothesis , and an agent . Suppose . Then, the update rule (1) ensures that (i) a.s., (ii) exists a.s. and satisfies , and (iii) the following holds:
Let denote the set of sample paths for which the assertions in Lemma 1 hold for each false hypothesis . Based on Lemma 1, we note that . Consequently, to prove the result, it suffices to establish the existence of , and such that (7) holds for each sample path . To this end, pick an arbitrary sample path We first argue that the local beliefs of every agent on the true state are bounded away from on . To see this, pick any agent . Suppose there exists some for which . Then, based on our choice of , it follows directly from Lemma 1 that , where the last inequality follows from condition (iii) in Theorem 1. In particular, given the structure of the update rule (1), it follows that for all time (since if at any instant, then the corresponding belief would remain at for all subsequent time-steps, thereby violating the fact that ). If there exists no for which , then every hypothesis in is observationally equivalent to from the point of view of agent . In this case, it is easy to see that based on (1), . In particular, this implies . This establishes our claim that on , the local beliefs of all the agents remain bounded away from .
To proceed, define , where the inequality follows from condition (iii) in Theorem 1. Pick a small number such that , and notice that our discussion concerning the evolution of the local beliefs readily implies the existence of a time-step , such that for all , . Now define , and observe that . This observation follows from the fact that given the structure of the update rules (2) and (3), and condition (iii) in Theorem 1, can equal if and only if some agent in the network sets its local belief on to at some time-step prior to . However, this possibility is ruled out in view of the previously established fact that on , . Let . It is apparent from the preceding discussion that . It remains to establish a similar result for the actual beliefs . To this end, let be the first time-step following that belongs to the set . Based on (3), notice that for all , and for each . Based on (2), at time-step , for an agent satisfies:
where the last equality follows from the fact that the local belief vectors generated via (1) are valid probability distributions over the hypothesis set at each time-step, and hence . The above argument applies identically to each agent in . Furthermore, it is easily seen that based on (3), and a similar reasoning as above, identical conclusions can be drawn for each time-step when the agents update their actual beliefs based on (2). This readily establishes (7), and completes the proof. ∎
Throughout this proof, we use the same notation as in the proof of Lemma 2. With as in Lemma 2, pick an arbitrary sample path , an agent , and an agent . Since condition (ii) in Theorem 1 is met, there exists a directed path of shortest length from agent to agent in . To prove the result, we shall induct on the length of such a path. First, we consider the base case when , i.e., when . In other words, we will analyze the asymptotic rate of rejection of at the source agent . Fix , and notice that since , Lemma 1 implies that there exists , such that:
Since , Lemma 2 guarantees the existence of a time-step , and a constant , such that on , . Let . For the remainder of the proof, to simplify the notation, we suppress the dependence of various quantities on the parameters , and , since such dependence can be easily inferred from context. Let be the first time-step following that belongs to , i.e., a time-step when agent updates its actual beliefs based on (2). Then, based on the preceding discussion and (2), we have:
where . Regarding the inequalities in (11), (a) follows directly from (2), whereas (b) follows from (10) and the fact that lower bounds the beliefs (both local and actual) of all agents on the true state . Note that consecutive trigger-points satisfy . Based on (3), we then have:
Based on our rule, the next update of takes place at time-step . Employing the same reasoning as we did to arrive at (11), we obtain:
Coupled with the above inequality, (3) once again implies:
Generalizing the above reasoning, we obtain:
, , where
This immediately leads to the conclusion that for any :
Taking the natural log on both sides of (17), dividing throughout by , and simplifying, we obtain that :
Let . Then, taking the limit inferior on both sides of the above inequality yields:
where the second inequality follows from the fact that , and the final equality results from further simplifications based on (18). Finally, note that can be made arbitrarily small in the above inequality, and that the above conclusions hold for a generic sample path , where . This establishes (9) for the case when , and completes the proof of the base case of our induction. To proceed, suppose (9) holds for each node satisfying , where is a non-negative integer satisfying (recall that represents the diameter of the graph ). Let be such that Thus, there must exist some node such that The induction hypothesis applies to this node , and hence, we have:
Let be the set of sample paths for which the above inequality holds. With defined as before, notice that , since and each have -measure . Pick an arbitrary sample path , and notice that based on arguments identical to the base case, on the sample path there exists a time-step , such that the beliefs of all agents on are bounded below by following , and
where is an arbitrary small number and
Repeating the above analysis for each time-step of the form , using (3), and following similar arguments as in the base case yields that ,
The induction step, and in turn the proof can be completed by substituting the expression for in the above inequality and recalling that . ∎
We are now in position to prove Theorem 1.
Vi Simulation Example
Consider a binary hypothesis testing scenario where , and is the true state of the world. The network of agents is depicted in Figure 1. The signal space for every agent is identical, and given by . The agent likelihood models satisfy: , and Thus, only agents and can distinguish between and , with their relative entropies satisfying (all other agents have . With , we have . Figure 2 plots the instantaneous rates of rejection of the false hypothesis for agents and , resulting from our proposed algorithm. Based on Figures 1 and 2, a few key observations are: (i) each informative agent dominates the speed of learning of agents that are close to it in the network, (ii) the rate of rejection of the false hypothesis is indeed agent-specific, and (iii) the simulation results agree very closely with the theoretical lower bounds on the limiting rates of rejection in Theorem 1.
Vii The Impact of Information Allocation on Asymptotic Learning Rates
Theorem 1 indicates that the asymptotic learning rates of the agents are shaped by a non-trivial interplay between the relative entropies of their signal models and the structure of the network. In view of this fact, our next goal is to conduct a preliminary analysis of how information should be allocated to the agents in order to maximize appropriate performance metrics that are a function of the asymptotic learning rates. Our investigation is inspired by similar questions in [jad2]; however, as we discuss next, our formulation differs considerably from [jad2]. Specifically, unlike [jad2], our proposed learning rule leads to asymptotic learning rates that are agent-dependent when (as seen in Section VI). Consequently, the performance metrics that we seek to maximize differ from those in [jad2]. As we shall soon see, while the eigenvector centrality plays a key role in shaping the speed of learning in [jad2], alternate network centrality measures become important when it comes to the belief dynamics generated by our rule.
To make the above ideas precise, suppose we are given a strongly-connected communication graph , and a set of signal structures , where each represents a family of parameterized marginals as defined in Section II. By an allocation of signal structures to agents, we imply a bijection between the elements of and the elements of the vertex set of , namely . Let represent the set of all possible bijections between the elements of and . Our objective is to optimally pick so as to maximize the performance metrics that we define next. To this end, given a distinct pair of hypotheses , recall from (4) that based on our proposed learning rule,
lower bounds the limiting rate at which agent rules out when is realized as the true state; the superscript reflects the dependence of the corresponding objects on the allocation policy We now introduce two measures of the quality of learning that are specific to our setting:
While captures the average rate of learning across the network, focuses on the agent that converges the slowest; given that any state in can be realized, these metrics account for the pair of states that are the hardest to tell apart. We seek to maximize and over the set of allocations . Our first result on this topic makes a connection to two popular network centrality measures, namely, the eccentricity centrality and the decay centrality, defined as follows. For a strongly-connected graph , the eccentricity centrality [hage], and the decay centrality [tsakas], of an agent are given by
where is the decay parameter.
The eccentricity centrality is a distance-based centrality measure that aims to find the ‘center’ of a graph such that a process originating at the center minimizes the response time to any other agent. The decay centrality is also a closeness-based centrality measure where an agent is rewarded for being close to other agents, with agents at higher distances contributing less to the centrality as compared to those that are closer. We have the following result.
Let be strongly-connected. Suppose , and let there exist a signal structure such that the following is true for all ,
Then, (i) any allocation such that maximizes , and (ii) any allocation such that maximizes .
Based on (31), we then obtain:
where the second equality follows from the fact that the signal structure of agent under allocation , and agent under allocation , are each equal to , and the last inequality follows by noting that based on the choice of agent . The proof of part (i) then follows by noting that the inequality in (34) holds for every pair For part (ii), we proceed as in part (i) and compare two allocations such that , and . The equalities in (33) hold once again, and combined with (31) lead to: