Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime

Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime

Abstract

This work examines the close interplay between cooperation and adaptation for distributed detection schemes over fully decentralized networks. The combined attributes of cooperation and adaptation are necessary to enable networks of detectors to continually learn from streaming data and to continually track drifts in the state of nature when deciding in favor of one hypothesis or another. The results in the paper establish a fundamental scaling law for the steady-state probabilities of miss-detection and false-alarm in the slow adaptation regime, when the agents interact with each other according to distributed strategies that employ small constant step-sizes. The latter are critical to enable continuous adaptation and learning. The work establishes three key results. First, it is shown that the output of the collaborative process at each agent has a steady-state distribution. Second, it is shown that this distribution is asymptotically Gaussian in the slow adaptation regime of small step-sizes. And third, by carrying out a detailed large deviations analysis, closed-form expressions are derived for the decaying rates of the false-alarm and miss-detection probabilities. Interesting insights are gained from these expressions. In particular, it is verified that as the step-size decreases, the error probabilities are driven to zero exponentially fast as functions of , and that the exponents governing the decay increase linearly in the number of agents. It is also verified that the scaling laws governing errors of detection and errors of estimation over networks behave very differently, with the former having an exponential decay proportional to , while the latter scales linearly with decay proportional to . Moreover, and interestingly, it is shown that the cooperative strategy allows each agent to reach the same detection performance, in terms of detection error exponents, of a centralized stochastic-gradient solution. The results of the paper are illustrated by applying them to canonical distributed detection problems.

D

istributed detection, adaptive network, diffusion strategy, consensus strategy, false-alarm probability, miss-detection probability, large deviations analysis.

1 Overview

\IEEEPARstart

Recent advances in the field of distributed inference have produced several useful strategies aimed at exploiting local cooperation among network nodes to enhance the performance of each individual agent. However, the increasing availability of streaming data continuously flowing across the network has added the new and challenging requirement of online adaptation to track drifts in the data. In the adaptive mode of operation, the network agents must be able to enhance their learning abilities continually in order to produce reliable inference in the presence of drifting statistical conditions, drifting environmental conditions, and even changes in the network topology, among other possibilities. Therefore, concurrent adaptation (i.e., tracking) and learning (i.e., inference) are key components for the successful operation of distributed networks tasked to produce reliable inference under dynamically varying conditions and in response to streaming data.

Several useful distributed implementations based on consensus strategies [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] and diffusion strategies [13, 14, 15, 16, 17, 18] have been developed for this purpose in the literature. The diffusion strategies have been shown to have superior stability ranges and mean-square performance when constant step-sizes are used to enable continuous adaptation and learning [19]. For example, while consensus strategies can lead to unstable growth in the state of adaptive networks even when all agents are individually stable, this behavior does not occur for diffusion strategies. In addition, diffusion schemes are robust, scalable, and fully decentralized. Since in this work we focus on studying adaptive distributed inference strategies, we shall therefore focus on diffusion schemes due to their enhanced mean-square stability properties over adaptive networks.

Now, the interplay between the two fundamental aspects of cooperation and adaptation has been investigated rather extensively in the context of estimation problems. Less explored in the literature is the same interplay in the context of detection problems. This is the main theme of the present work. Specifically, we shall address the problem of designing and characterizing the performance of diffusion strategies that reconcile both needs of adaptation and detection in decentralized systems. The following is a brief description of the scenario of interest.

A network of connected agents is assumed to monitor a certain phenomenon of interest. As time elapses, the agents collect an increasing amount of streaming data, whose statistical properties depend upon an unknown state of nature. The state is formally represented by a pair of hypotheses, say, and . At each time instant, each agent is expected to produce a decision about the state of nature, based upon its own observations and the exchange of information with neighboring agents. The emphasis here is on adaptation: we allow the true hypothesis to drift over time, and the network must be able to track the drifting state. This framework is illustrated in Fig. 1, where we show the time-evolution of the actual realization of the decision statistics computed by three generic network agents. Two situations are considered. In the first case, the agents run a constant-step size diffusion strategy [20, 15] and in the second case, the agents run a consensus strategy with a diminishing step-size of the form  [1, 2, 3, 4, 5, 6]. Note from the curves in the figure that the statistics computed by different sensors are hardly distinguishable, emphasizing a certain equivalence in performance among distinct agents, an important feature that will be extensively commented on in the forthcoming analysis.

Figure 1: The top panel illustrates the time-evolution of the decision statistics at three generic local agents for two situations: (a) constant step-size adaptation using a diffusion strategy and (b) diminishing step-size updates using and a running consensus strategy. The actual variation of the true hypothesis is depicted in the bottom panel from to to .

Assume that high (positive) values of the statistic correspond to deciding for , while low (negative) values correspond to deciding for . The bottom panel in the figure shows how the true (unknown) hypothesis changes at certain (unknown) epochs following the sequence . It is seen in the figure that the adaptive diffusion strategy is more apt in tracking the drifting state of nature. It is also seen that the diminishing step-size consensus implementation is unable to track the changing conditions. Moreover, the inability to track the drift degrades further as time progresses since the step-size sequence decays to zero as . For this reason, in this work we shall set the step-sizes to constant values to enable continuous adaptation and learning by the distributed network of detectors. In order to evaluate how well these adaptive networks perform, we need to be able to assess the goodness of the inference performance (reliability of the decisions), so as to exploit the trade-off between adaptation and learning capabilities. This will be the main focus of the paper.

1.1 Related Work

The literature on distributed detection is definitely rich, see, e.g., [21, 22, 23, 24, 25, 26, 27, 28] as useful entry points on the topic. A distinguishing feature of our approach is its emphasis on adaptive distributed detection techniques that respond to streaming data in real-time. We address this challenging problem with reference to the fully decentralized setting, where no fusion center is admitted, and the agents cooperate through local interaction and consultation steps.

For several useful formulations of distributed point estimation and detection, the use of stochastic approximation consensus-based solutions with diminishing step-sizes leads to asymptotically optimal performance, either in the sense of asymptotic variance in point estimation [12], in the sense of error exponents [4, 5, 6], or in the sense of asymptotic relative efficiency in the locally optimum detection framework [2]. Optimality in these works is formulated in reference to the centralized solution, and the qualification “asymptotic” is used to refer either to a large number of observations or a large time window. The error performance (e.g., mean-square error for estimation or error probabilities for detection) is shown in these works to decay with optimal rates as time elapses, provided that some conditions on the network structure are met. For these results to hold, it is critical for the statistical properties of the data to remain invariant and for the algorithms to rely on a recursive test statistics with a diminishing step-size.

In some other distributed inference applications, however, the statistical properties of the data can vary over time. For instance, in a detection problem, the actual hypothesis in force, and/or some parameters of the pertinent distributions, might change at certain moments. Therefore, the adaptation aspect, i.e., the capability of persistently tracking dynamic scenarios, becomes important. In such scenarios, the diffusion algorithms (with non-diminshing, constant step-size) provide effective mechanisms for continuous adaptation and learning. Similar to the consensus-based algorithms with diminishing step-sizes, they are easy to implement, since they involve linear operations, and are naturally suited to a fully distributed implementation. However, differently from the consensus algorithms with diminishing step-size, the strategies with constant step-size are inherently able to work under dynamically changing conditions and offer enhanced tracking capability.

1.2 Inherent Tracking Mechanism

It is well-known in the adaptation and learning literature that using constant step-sizes in the update relations automatically infuses the algorithms with a tracking mechanism that enables them to track variations in the underlying models. This is because constant step-sizes keep adaptation alive, forever. This is in contrast to decaying step-sizes, which tend to zero and ultimately stop adapting. With a constant step-size, learning is always active. When the hypothesis changes, an algorithm with a constant step-size will continue learning from that point onwards and given sufficient time to learn, the steady-state analysis in this article will show that the probabilities of error will indeed decay exponentially as functions of the inverse of the step-size.

The key challenge in these scenarios is that a constant step-size keeps the update active, which then causes gradient noise to seep continuously into the operation of the algorithm. This effect does not happen for decaying step-sizes because the diminishing step-size annihilates the gradient noise term in the limit. However, a decaying step-size cannot track changing hypotheses due to the vanishing step-size. The difficulty in the constant step-size case is therefore to show that despite the presence of gradient noise, the dynamics of the learning algorithm is such that it can keep this effect under check and is capable to learn. The more it learns, the more it reduces the size of the gradient noise and this feedback mechanism leads to effective learning. This is one of the key conclusions in this work, namely, showing that indeed the probabilities of error decay exponentially with the inverse of the step-size. This result is non-trivial and the derivations will take some effort before arriving at the insightful scaling laws that we are presenting in this work.

1.3 Analysis of Detection Performance

The aforementioned properties of the diffusion strategies used in this work explain their widespread utilization in the context of adaptive estimation [17], and motivate their use in the context of adaptive distributed detection [29, 30, 31]. With reference to this class of algorithms, while several results have been obtained for the mean-square-error (MSE) estimation performance of adaptive networks [20, 15], less is known about the performance of distributed detection networks. In particular, in [29], the miss-detection and false-alarm probabilities have been evaluated with reference to Gaussian observations. However, a detailed analytical characterization of the detection performance (i.e., false-alarm and detection probabilities), with reference to a general observational model, is still missing. This is mainly due to the fact that results on the asymptotic distribution of the error quantities under constant step-size adaptation over networks are largely unavailable in the literature.

While reference [32] argues that the error in single-agent least-mean-squares (LMS) adaptation converges in distribution, the resulting distribution is not characterized. These questions are considered in [33, 34] in the context of distributed estimation over adaptive networks. Nevertheless, these results on the asymptotic distribution of the errors are still insufficient to characterize the rate of decay of the probability of error over networks of distributed detectors. The main purpose of this work is to fill this gap. To do so, it is necessary to pursue a large deviations analysis in the constant step-size regime. Motivated by these remarks, we therefore provide a thorough statistical characterization of the diffusion network in a manner that enables detector design and analysis.

Notation. We use boldface letters to denote random variables, and normal font letters for their realizations. Capital letters refer to matrices, small letters to both vectors and scalars. Sometimes we violate this latter convention, for instance, we denote the total number of sensors by . The symbols and are used to denote the probability and expectation operators, respectively. The notation and , with , means that the pertinent statistical distribution corresponds to hypothesis or .

2 Preliminaries and Main Results

Consider a connected network of agents. The scalar observation collected by the -th sensor at time will be denoted by , . Data are assumed to be spatially and temporally independent and identically distributed (i.i.d.), conditioned on the hypothesis that gives rise to them. The distributed network is interested in making an inference about the true state of nature (i.e., the underlying hypothesis), which is allowed to vary over time. Since in this work we focus on a steady-state analysis, it is unnecessary at this stage to introduce an explicit dependence of the datum on the particular hypothesis giving rise to it.

Remark. When dealing with i.i.d. observations across sensors, the important issue of local versus aggregate distinguishability is bypassed. In most practical scenarios, sensors observe different aspects of a field, so local distinguishability is hard to achieve but the collective observation model may still be globally informative. The issue when local information is not sufficient for discrimination has been studied in several works before, including [35, 36, 37], and in other related references on diffusion strategies. In the context of multi-agent processing, the distinguishability condition essentially amounts to a positivity condition on the global Gramian (Hessian) matrix while allowing the individual Gramians to be non-negative definite. Learning is still possible in these cases, as shown, for example, in [38, 39, 17].  

As it is well-known, for the i.i.d. data model, an optimal centralized (and non-adaptive) detection statistic is the sum of the log-likelihoods. When these are not available, alternative detection statistics obtained as the sum of some suitably chosen functions of the observations are often employed, as happens in some specific frameworks, e.g., in locally optimum detection [45] and in universal hypothesis testing [46]. Accordingly, each sensor in the network will try to compute, as its own detection statistic, a weighted combination of some function of the local observations. We assume the symbol represents the local statistic that is available at time at sensor .

Since we are interested in an adaptive inferential scheme, and given the idea of relying on weighted averages, we resort to the class of diffusion strategies for adaptation over networks [29, 15]. These strategies admit various forms. We consider the ATC form due to some inherent advantages in terms of a slightly improved mean-square-error performance relative to other forms [15]. In the ATC diffusion implementation, each node updates its state from to through local cooperation with its neighbors as follows:

(1)
(2)

where is a small step-size parameter. In this construction, node first uses its local statistic, , to update its state from to an intermediate value . All other nodes in the network perform similar updates simultaneously using their local statistics. Subsequently, node aggregates the intermediate states of its neighbors using nonnegative convex combination weights that add up to one. Again, all other nodes in the network perform a similar calculation. If we collect the combination coefficients into a matrix , then is a right-stochastic matrix in that the entries on each of its rows add up to one:

(3)

with being a column-vector with all entries equal to .

2.1 Performance and Convergence Analyses

At time , the -th sensor needs to produce a decision based upon its state value . To this aim, a decision rule must be designed, by choosing appropriate decision regions. The performance of the test will be measured according to the Type-I (false-alarm) and Type-II (miss-detection) error probabilities defined, respectively, as

(4)
(5)

Note that these probabilities depend upon the statistical properties of the whole set of data used in the diffusion algorithm up to current time . In particular, the error probabilities depend upon the different variations of the statistical distributions may have occurred during the evolution of the algorithm, and not only upon the particular hypothesis in force at time .

Therefore, a rigorous analytical characterization of the system in terms of its overall inference performance at each time instant, and under general operation modalities (i.e., for arbitrarily varying statistical conditions) is generally not viable. This implies, among other difficulties, that the structure of the optimal, or even a reasonable test, is unknown. A standard approach in the adaptation literature to get useful performance metrics and meaningful insights, consists of splitting the analysis in two parts:

  • A transient analysis where, starting from a given state, some variations in the statistical conditions occur and the time to track such variations is evaluated. It is possible to carry out studies that focus on the transient phase of the learning algorithm, and to clarify its behavior during this stage of operation, as is done in [38, 39].

  • A steady-state analysis, where the inference performance is evaluated with reference to an infinitely long period of stationarity. Even in the steady-state regime, an exact analytical characterization of the inference performance is seldom affordable. Therefore, closed-form results are usually obtained working in the regime of slow adaptation, i.e., of small step-sizes.

These two views are complementary. Typically, for a given value of the step-size , the diffusion algorithm exhibits the following features:

  • The convergence rate towards the steady-state regime is known to occur at an exponential rate in the order of for some ; this is a faster rate than that is afforded, for example, by diminishing step-sizes. Nevertheless, in the constant step-size case, the smaller the value of is, the closer the value of gets to one.

  • The steady-state inference performance is a decreasing function of the step-size. Therefore, the lower is, the lower the steady-state error.

In this article, we address in some detail the steady-state performance of diffusion strategies for distributed detection over adaptive networks. Our main interest is in showing that the multi-agent network is able to learn well, with error probabilities exhibiting an exponential decay as functions of . In particular, our analysis will be conducted with reference to the steady-state properties (as ), and for small values of the step-size (). Throughout the paper, the term steady-state will refer to the limit as the time-index goes to infinity, while the term asymptotic will be used to refer to the slow adaptation regime where . Specifically, we will follow these steps:

  • We show that, in the stationary, steady-state regime, has a limiting distribution as goes to infinity (Theorem 1).

  • For small step-sizes, the steady-state distribution of approaches a Gaussian, i.e., it is asymptotically normal (Theorem 2).

  • We characterize the large deviations of the steady-state output in the slow adaptation regime when (Theorem 3).

  • The results of the above steps will provide a series of tools for designing the detector and characterizing its performance (Theorem 4).

2.2 Comparison with Decaying Step-Size Solutions

It is useful to contrast the above results with those pertaining to distributed detection algorithms with diminishing step-size [4, 5, 6]. The result in Theorem 1 reveals that, under stationary conditions, the detection statistic (i.e., the diffusion output ) converges to a limiting distribution, and the results in Theorem 2 add that such limiting distribution is approximately Gaussian in the slow adaptation regime. In contrast, in the diminishing step-size case, the detection statistic will collapse, as time elapses, into a deterministic value (e.g., the Kullback-Leibler divergence). Such convergence to a deterministic value reflects the continuously improving performance as time elapses, with diminishing step-sizes. In particular, under stationary conditions, the error probabilities for diminishing step-size algorithms decay exponentially as functions of the time index  — see, e.g. [4, 5, 6]. The latter feature must be contrasted with the results of our Theorems 3 and 4, where the exponential decay of the error probabilities does not refer to the time index . Instead, we find the new result that the error probabilities decay exponentially as functions of the (inverse of the) step-size .

Finally, we would like to mention that the detailed statistical characterization offered by Theorems 1-3 is not confined to the specific detection problems we are dealing with. As a matter of fact, these results are of independent interest, and might be useful for the application of adaptive diffusion strategies in broader contexts.

2.3 Main Results

As explained in the previous section, we focus on a connected network of sensors, performing distributed detection by means of adaptive diffusion strategies. The adaptive nature of the solution allows the network to track variations in the hypotheses being tested over time. In order to enable continuous adaptation and learning, we shall employ distributed diffusion strategies with a constant step-size parameter . Now, let and represent the steady-state (as ) Type-I and Type-II error probabilities at the -th sensor. One of the main conclusions established in this paper can be summarized by the following scaling laws:

(6)

where the notation means equality to the leading exponential order as goes to zero [40]. In the above expressions, the parameters and are solely dependent on the moment generating function of the single-sensor data , and of the decision regions. These parameters are independent of the step-size , the number of sensors , and the network connectivity. Result (6) has at least four important and insightful ramifications about the performance of adaptive schemes for distributed detection over networks.

To begin with, Eq. (6) reveals a fundamental scaling law for distributed detection with diffusion adaptation, namely, it asserts that as the step-size decreases, the error probabilities are driven to zero exponentially as functions of , and that the error exponents governing such a decay increase linearly in the number of sensors. These implications are even more revealing if examined in conjunction with the known results concerning the scaling law of the Mean-Square-Error (MSE) for adaptive distributed estimation over diffusion networks [20, 15]. Assuming a connected network with sensors, and using sufficiently small step-sizes , the MSE that is attained by sensor obeys (see expression (32) in [15]):

(7)

where the symbol denotes proportionality. Some interesting symmetries are observed. In the estimation context, the MSE decreases as goes to zero, and the scaling rate improves linearly in the number of sensors. Recalling that smaller values of mean a lower degree of adaptation, we observe that reaching a better inference quality costs in terms of adaptation speed. This is a well-known trade-off in the adaptive estimation literature between tracking speed and estimation accuracy.

Second, we observe from (6) and (7) that the scaling laws governing errors of detection and estimation over distributed networks behave very differently, the former exhibiting an exponential decay proportional to , while the latter is linear with decay proportional to . The significance and elegance of this result for adaptive distributed networks lie in revealing an intriguing analogy with other more traditional inferential schemes. As a first example, consider the standard case of a centralized, non-adaptive inferential system with i.i.d. data points. It is known that the error probabilities of the best detector decay exponentially fast to zero with , while the optimal estimation error decays as  [41, 42]. Another important case is that of rate-constrained multi-terminal inference [43, 44]. In this case the detection performance scales exponentially with the bit-rate while, again, the squared estimation error vanishes as . Thus, at an abstract level, reducing the step-size corresponds to increasing the number of independent observations in the first system, or increasing the bit-rate in the second system. The above comparisons furnish an interesting interpretation for the step-size as the basic parameter quantifying the cost of information used by the network for inference purposes, much as the number of data or the bit-rate in the considered examples.

A third aspect pertaining to the performance of the distributed network relates to the potential benefits of cooperation. These are already encoded into (6), and we have already implicitly commented on them. Indeed, note that the error exponents increase linearly in the number of sensors. This implies that cooperation offers exponential gains in terms of detection performance.

The fourth and final ramification we would like to highlight relates to how much performance is lost by the distributed solution in comparison to a centralized stochastic gradient solution. Again, the answer is contained in (6). Specifically, the centralized solution is equivalent to a fully connected network, so that (6) applies to the centralized case as well. As already mentioned, the parameters and do not depend on the network connectivity, which therefore implies that, as the step-size decreases, the distributed diffusion solution of the inference problem exhibits a detection performance governed by the same error exponents of the centralized system. This is a remarkable conclusion and it is also consistent with results in the context of adaptive distributed estimation over diffusion networks [15].

We now move on to describe the adaptive distributed solution and to establish result (6) and the aforementioned properties.

3 Existence of Steady-State Distribution

Let denote the vector that collects the state variables from across the network at time , i.e.,

(8)

Likewise, we collect the local statistics at time into the vector . It is then straightforward to verify from the diffusion strategy (1)–(2) that the vector is given by:

(9)

We are concerned here with a steady-state analysis. Accordingly, we must examine the situation where the data are possibly nonstationary up to a certain time instant, after which they are drawn from the same stationary distribution for infinitely long time. This implies that, when performing the steady-state analysis, it suffices to assume that the data, for all , arise from one and the same distribution. The past history (including possible drifts occurred in the statistical conditions) that influences the overall algorithm evolution, is reflected in the initial state vector . In addition, since, for , we only need to specify the particular distribution from which data are drawn, in the forthcoming derivations we shall conduct our study with reference to a sequence of i.i.d. data with a given distribution. Later on, when applying the main findings to the detection problem, we shall use a subscript to denote that data follow the distribution corresponding to a particular hypothesis.

We are now ready to show the existence and the specific shape of the limiting distribution. By making the change of variables , Eq. (9) can be written as

(10)

It follows that the state of the -th sensor is given by:

where the scalars are the entries of the matrix power:

(12)

Since we are interested in reaching a balanced fusion of the observations, we shall assume that is doubly stochastic with second largest eigenvalue magnitude strictly less than one, which yields [16, 8, 48]:

(13)

Now, we notice that the first term on the RHS of (LABEL:eq:mainATC) vanishes almost surely (a.s.) (and, hence, in probability [41]) with , since, for any initial state vector , we have:

(14)

Accordingly, if we are able to show that the second term on the RHS of (LABEL:eq:mainATC) converges to a certain limiting distribution, we can then conclude that the variable converges as well to the same limiting distribution, as a direct application of Slutsky’s Theorem [41].

In order to reveal the steady-state behavior of , it suffices to focus on the last summation in (LABEL:eq:mainATC). We observe preliminarily that the term in (10) depends on the time index in such a way that the most recent datum is assigned the highest scaling weight, in compliance with the adaptive nature of the algorithm. However, since the vectors are i.i.d. across time, and since we shall be only concerned with the distribution of partial sums involving these terms, the statistical properties of the summation in (10) are left unchanged if we replace with a random vector , where is a sequence of i.i.d. random vectors distributed similarly to the . Formally, as regards the steady-state term on the RHS of (LABEL:eq:mainATC), we can write:

where denotes equality in distribution, and where the definition of should be clear. As a result, we are faced with a sum of independent, but not identically distributed, random variables. Let us evaluate the first two moments of the sum:

(16)

and

(17)

where VAR denotes the variance operator, and . We have thus shown that the expectation of the sum expression from (LABEL:eq:sumexpress) converges to , and that its variance converges to a finite value. In view of the Infinite Convolution Theorem — see [49, p. 266], these two conditions are sufficient to conclude that the RHS of (LABEL:eq:sumexpress), i.e., the sum of random variables , converges in distribution as , and the first two moments of the limiting distribution are equal to and . The random variable characterized by the limiting distribution will be denoted by , where we make explicit the dependence upon the step-size for later use.

The above statement can be sharpened to ascertain that the sum of random variables actually converges almost surely. This conclusion can be obtained by applying Kolmogorov’s Two Series Theorem [49]. In view of the a.s. convergence, it makes sense to define the limiting random variable as:

(18)

We wish to avoid confusion here. We are not stating that the actual diffusion output converges almost surely (a behavior that would go against the adaptive nature of the diffusion algorithm). We are instead claiming that converges in distribution to a random variable that can be conveniently defined in terms of the a.s. limit (18).

The main result about the steady-state behavior of the diffusion output is summarized below (the symbol means convergence in distribution).

Theorem 1: (Steady-state distribution of ). The state variable that is generated by the diffusion strategy (1)–(2) is asymptotically stable in distribution, namely,

(19)

It is useful to make explicit the meaning of Theorem 1. By definition of convergence in distribution (or weak convergence), the result (19) can be formally stated as [50, 42]:

(20)

for any set such that , where denotes the boundary of . It is thus seen that the properties of the steady-state variable will play a key role in determining the steady-state performance of the diffusion output. Accordingly, we state two useful properties of .

First, when the local statistic has an absolutely continuous distribution (where the reference measure is the Lebesgue measure over the real line), it is easily verified that the distribution of is absolutely continuous as well. Indeed, note that we can write . Now observe that , which has an absolutely continuous distribution by assumption, is independent of the other term. The result follows by the properties of convolution and from the fact that the distribution of the sum of two independent variables is the convolution of their respective distributions.

Second, when the local statistic is a discrete random variable, by the Jessen-Wintner law [51, 52], we can only conclude that is of pure type, namely, its distribution is pure: absolutely continuous, or discrete, or continuous but singular.

An intriguing case is that of the so-called Bernoulli convolutions, i.e., random variables of the form , where are equiprobable . For this case, it is known that if , then the limiting distribution is a Cantor distribution [53]. This is an example of a distribution that is neither discrete nor absolutely continuous. When , which is relevant for our discussion since we shall be concerned with small step-sizes, the situation is markedly different, and the distribution is absolutely continuous for almost all values of .

Before proceeding, we stress that we have proved that a steady-state distribution for exists, but its form is not known. Accordingly, even in steady-state, the structure of the optimal test is still unknown. In tackling this issue, and recalling that the regime of interest is that of slow adaptation, we now focus on the case .

4 The Small- Regime.

While the exact form of the steady-state distribution is generally impossible to evaluate, it is nevertheless possible to approximate it well for small values of the step-size parameter. Indeed, in this section we prove two results concerning the statistical characterization of the steady-state distribution for . The first one is a result of asymptotic normality, stating that approaches a Gaussian random variable with known moments as goes to zero (Theorem 2). The second finding (Theorem 3) provides the complete characterization for the large deviations of . In the following, is a shortcut for a Gaussian distribution with mean and variance , and the symbol means “distributed as”.

Theorem 2: (Asymptotic normality of as ). Under the assumption , the variable fulfills, for all :

(21)

Proof: The argument requires dealing with independent but non-identically distributed random variables, as done in the Lindeberg-Feller CLT (Central Limit Theorem) [49]. This theorem, however, does not apply to our setting since the asymptotic parameter is not the number of samples, but rather the step-size. Some additional effort is needed, and the detailed technical derivation is deferred to Appendix A.  

4.1 Implications of Asymptotic Normality

Let us now briefly comment on several useful implications that follow from the above theorem:

  1. First, note that all sensors share, for small enough, the same distribution, namely, the inferential diffusion strategy equalizes the statistical behavior of the agents. This finding complements well results from [20, 15, 34] where the asymptotic equivalence among the sensors has been proven in the context of mean-square-error estimation. One of the main differences between the estimation context and the detection context studied in this article is that in the latter case, the regression data is deterministic and the randomness arises from the stochastic nature of the statistics . For this reason, the steady-state distribution in (21) is characterized in terms of the moments of these statistics and not in terms of the moments of regression data, as is the case in the estimation context.

  2. The result of Theorem 2 is valid provided that the connectivity matrix fulfills (13). This condition is satisfied when the network topology is strongly-connected, i.e., there exists a path with nonzero weights connecting any two arbitrary nodes and at least one node has  [16]. Obviously, condition (13) is also satisfied in the fully connected case when for all . This latter situation would correspond to a representation of the centralized stochastic gradient algorithm, namely, an implementation of the form

    (22)

    where denotes the output by the centralized solution at time . The above algorithm can be deduced from (1)–(2) by defining

    (23)

    Now, since the moments of the limiting Gaussian distribution in (21) are independent of the particular connectivity matrix, the net effect is that each agent of the distributed network acts, asymptotically, as the centralized system. This result again complements well results in the estimation context where the role of the statistics variables is replaced by that of stochastic regression data [54].

  3. The asymptotic normality result is powerful in approximating the steady-state distribution for relatively small step-sizes, thus enabling the analysis and design of inferential diffusion networks in many different contexts. With specific reference to the detection application that is the main focus here, Eq. (21) can be exploited for an accurate threshold setting when one desires to keep under control one of the two errors, say, the false-alarm probability, as happens, e.g., in the Neyman-Pearson setting [42]. To show a concrete example on how this can be done, let us assume that, without loss of generality, , and consider a single-threshold detector for which:

    (24)

    where the threshold is set as

    (25)

    Here, is the variance of under , denotes the complementary CDF for a standard normal distribution, and is the prescribed false-alarm level. By (21), it is straightforward to check that this threshold choice ensures

    (26)

In summary, Theorem 2 provides an approximation of the diffusion output distribution for small step-sizes. At first glance, this may seem enough to obtain a complete characterization of the detection problem. A closer inspection reveals that this is not the case. A good example to understand why Theorem 2 alone is insufficient for characterizing the detection performance is obtained by examining the Neyman-Pearson threshold setting just described in (25)–(26) above. While we have seen that the asymptotic behavior of the false-alarm probability in (26) is completely determined by the application of Theorem 2, the situation is markedly different as regards the miss-detection probability . Indeed, by using (25) we can write:

Since , the quantity diverges to as . As a consequence, the fact that is asymptotically normal does not provide much more insight than revealing that the miss-detection probability converges to zero as . A meaningful asymptotic analysis would instead require to examine the way this convergence takes place (i.e., the error exponent). The same kind of problem is found when one lets both error probabilities vanish exponentially, such that the Type-I and Type-II detection error exponents furnish a meaningful asymptotic characterization of the detector. In order to fill these gaps, the study of the large deviations of is needed.

4.2 Large Deviations of .

From (21) we learn that, as , the diffusion output shrinks down to its limiting expectation and that the small (of order ) deviations around this value have a Gaussian shape. But this conclusion is not helpful when working with large deviations, namely, with terms like:

(28)

which play a significant role in detection applications. While the above convergence to zero can be inferred from (21), it is well known that (21) is not sufficient in general to obtain the rate at which the above probability vanishes. In order to perform accurate design and characterization of reliable inference systems [55, 56] it is critical to assess this rate of convergence, which turns out to be the main purpose of a large deviations analysis.

Accordingly, we will be showing in the sequel that the process obeys a Large Deviations Principle (LDP), namely, that the following limit exists [55, 56]:

(29)

for some that is called the rate function. Equivalently:

(30)

where stands for any correction term growing slower than , namely, such that as , and the notation was introduced in (6). From (30) we see that, in the large deviations framework, only the dominant exponential term is retained, while discarding any sub-exponential terms. It is also interesting to note that, according to (30), the probability that belongs to a given region is dominated by the infimum of the rate function within the region . In other words, the smallest exponent ( highest probability) dominates, which is well explained in [56] through the statement: “any large deviation is done in the least unlikely of all the unlikely ways”.

In summary, the LDP generally implies an exponential scaling law for probabilities, with an exponent governed by the rate function. Therefore, knowledge of the rate function is enough to characterize the exponent in (30). We shall determine the expression for pertinent to our problem in Theorem 3 further ahead — see Eq. (37).

In the traditional case where the statistic under consideration is the arithmetic average of i.i.d. data, the asymptotic parameter is the number of samples and the usual tool for determining the rate function in the LDP is Cramér’s Theorem [55, 56]. Unfortunately, in our adaptive and distributed setting, we are dealing with a more general statistic , whose dependence is on the step-size parameter and not on the number of samples. Cramér’s Theorem is not applicable in this case, and we must resort to a more powerful tool, known as the Gärtner-Ellis Theorem [55, 56], stated below in a form that uses directly the set of assumptions relevant for our purposes.

Gärtner-Ellis Theorem [56]. Let be a family of random variables with Logarithmic Moment Generating Function (LMGF) If

(31)

exists, with for all , and is differentiable in , then satisfies the LDP property (29) with rate function given by the Fenchel-Legendre transform of , namely:

(32)

In what follows, we shall use capital letters to denote Fenchel-Legendre transforms, as done in (32).

We now show how the result allows us to assess the asymptotic performance of the diffusion output in the inferential network. Let us introduce the LMGF of the data , and that of the steady-state variable , respectively:

(33)
(34)

Theorem 3: (Large deviations of as ). Assume that for all . Then, for all :

  • (35)

    where

    (36)
  • The steady-state variable obeys the LDP with a rate function given by:

    (37)

    that is, by the Fenchel-Legendre transform of multiplied by the number of sensors .

Proof: See Appendix B.  

4.3 Main Implications of Theorem 3

From Theorem 3, a number of interesting conclusions can be drawn:

  • The function in (36) depends only upon the LMGF of the original statistic , and does not depend on the number of sensors.

  • As a consequence of the above observation, part implies that the rate function (and, therefore, the large deviations exponent) of the diffusion output depends linearly on the number of sensors. Moreover, the rate can be determined by knowing only the statistical distribution of the input data .

  • The rate function does not depend on the particular sensor . This implies that all sensors are asymptotically equivalent also in terms of large deviations, thus strengthening what we have already found in terms of asymptotic normality — see Theorem 2 and the subsequent discussion.

  • Theorem 3 can be applied to the centralized stochastic algorithm (22) as well, and, again, the diffusion strategy is able to match, asymptotically, the centralized solution.

Before ending this section, it is useful to comment on some essential features of the rate function , which will provide insights on its usage in connection with the distributed detection problem. To this aim, we refer to the following convexity properties shown in Appendix C (see also [55], Ex. 2.2.24, and [56], Ex. I.16):

  • for all , implying that is strictly convex.

  • is strictly convex in the interior of the set:

    (38)
  • attains its unique minimum at , with

    (39)

In light of these properties, it is possible to provide a geometric interpretation for the main quantities in Theorem 3, as illustrated in Fig. 2. The leftmost panel shows a typical behavior of the LMGF of the original data . Using the result , and examining the sign of , it is possible to deduce the corresponding typical behavior of , depicted in the middle panel. As it can be seen, the slope at the origin is preserved, and is still equal to the expectation of the original data, . The intersection with the -axis is changed, and moves further to the right in the considered example. Starting from , it is possible to draw a sketch of its Fenchel-Legendre transform (rightmost panel), which illustrates its convexity properties, and the fact that the minimum value of zero is attained only at .

Figure 2: Leftmost panel: The LMGF of the original data ; its slope at the origin is . Middle panel: The function defined by (36) is strictly convex; its slope at the orgin is also equal to . The labels underneath the plot illustrate the intervals over which is negative and positive for the LMGF shown in the leftmost plot. Rightmost panel: The Fenchel-Legendre transform, , which is relevant for the evaluation of the rate function, attains the minimum value of zero at .

5 The Distributed Detection Problem

The tools and results developed so far allow us to address in some detail the detection problem we are interested in. Let us denote the decision regions in favor of and by and , respectively. We assume that they are the same at all sensors because, in view of the asymptotic equivalence among sensors proved in the previous section, there is no particular interest in making a different choice. Note, however, that all the subsequent development does not rely on this assumption and applies, mutatis mutandis, to the case of distinct decision regions used by distinct agents.

The Type-I and Type-II error probabilities at the -th sensor at time are defined in (4) and (5), respectively. Since we are interested in their steady-state behavior, namely, for an increasingly large interval where a certain hypothesis stays in force, the only distribution that matters is that corresponding to such hypothesis. Therefore, it is legitimate to write:

(40)
(41)

where the subscripts and denote here the (stationary) situation where the data collected for all come from one and the same distribution. As already observed, this simply corresponds to saying that the stationarity period used to compute the steady-state distribution starts at time . Some questions arise. Do these limits exist? Do these probabilities vanish as approaches infinity? Theorem 1 provides the answers. Indeed, we found that stabilizes in distribution as goes to infinity. In the sequel, in order to avoid dealing with pathological cases, we shall assume that and that . This is a mild assumption, which is verified, for instance, when the limiting random variable has an absolutely continuous distribution, and the decision regions are not so convoluted to have boundaries with strictly positive measure. Accordingly, by invoking the weak convergence result of Theorem 1, and in view of (20) we can write:

(42)
(43)

where the dependence upon has been made explicit for later use. We notice that, in the above, we work with decision regions that do not depend on , which corresponds exactly to the setup of Theorem 1. Generalizations where the regions are allowed to change with can be handled by resorting to known results from asymptotic statistics. To give an example, consider the meaningful case of a detector with a sequence of thresholds that converges to a value as . Here,

(44)

which can be seen, e.g., as an application of Slutsky’s Theorem [41, 42].

From (42)–(43), it turns out that, as time elapses, the error probablities do not vanish exponentially. As a matter of fact, they do not vanish at all. This situation is in contrast to what happens in the case of running consensus strategies with diminishing step-size studied in the literature [1, 2, 3, 4, 5, 6]. We wish to avoid confusion here. In the diminishing step-size case, one does need to examine the effect of large deviations [4, 5, 6] for large , quantifying the rate of decay to zero of the error probabilities as time progresses. In the adaptive context, on the other hand, where constant step-sizes are used to enable continuous adaptation and learning, the large deviations analysis is totally different, in that it is aimed at characterizing the decaying rate of the error probabilities as the step-size approaches zero.

Returning to the detection performance evaluation (42)–(43), we stress that the steady-state values of these error probabilities are unknown, since the distribution of is generally unknown. However, the large deviations result offered by Theorem 3 allows us to characterize the error exponents in the regime of small step-sizes.

Theorem 3 can be tailored to our detection setup as follows (subscripts and are used to indicate that the statistical quantities are evaluated under and , respectively):

Theorem 4: (Detection error exponents). For , let be the decision regions –independent of – and assume that for all , and define:

(45)

Then, for all , Eq.(6) holds true, namely,

(46)

with

(47)

where is the Fenchel-Legendre transform of .  

Remark I. The technical requirement that the LMGFs and are finite is met in many practical detection problems, as already shown in [5]. In particular, the assumption is clearly verified when the observations have (the same, under the two hypotheses) compact support, a special interesting case being that of discrete variables supported on a finite alphabet; and for shift-in-mean detection problems where the data distributions fulfill mild regularity conditions — see Remark II in [5] for a detailed list.

Figure 3: A geometric view of Theorem 4.

Remark II. As typical in large deviations analysis, we have worked with regions and that do not depend on the step-size . Generalizations are possible to the case in which these regions depend on . A relevant case where this might be useful is the Neyman-Pearson setup, where one needs to work with a fixed (non-vanishing) value of the false-alarm probability. An example of this scenario is provided in Sec. 6.3 — see the discussion following (78) — along with the detailed procedure for the required generalization.

In Fig. 3, we provide a geometric interpretation that can be useful to visualize the main message conveyed by Theorem 4. In order to rule out trivial cases, we assume that , as happens, e.g., in the standard situation where the local statistic is a log-likelihood ratio and the detection problem is identifiable [42]. Without loss of generality, we take , and, for the sake of concreteness, we consider a detector with threshold , amounting to the following form for the decision regions:

(48)

Let us set since, as will be clear soon, choosing a threshold outside the range will lead to trivial performance for one of the error exponents. According to Theorem 4, to evaluate the exponent (resp., ), one must consider the worst-case, i.e., the smallest value of the function (resp., ), within the corresponding error region (resp., ). In view of the convexity properties discussed at the end of Sec. 4.3, and reported in Appendix C, we see that, for the threshold detector, both minima are attained only at . Certainly, this shape turns out to be of great interest in practical applications where, inspired by the optimality properties of a log-likelihood ratio test in the centralized case, a threshold detector is often an appealing and reasonable choice. On the other hand, we would like to stress that different, arbitrary decision regions can be in general chosen, and that the minima of and in Fig. 3 might be correspondingly located at two different points.

In summary, Theorem 4 allows us to compute the exponents and as functions of the kind of statistic employed by the sensors, which determines the shape of the LMGFs to be used in (45); and of the employed decision regions relevant for the minimizations in (47). Once and have been found, the error probabilities and can be approximated using Eq. (6). This result is then key for both detector design and analysis, so that we are now ready to illustrate the operation of the adaptive distributed network of detectors.

6 Examples of Application

In this section, we apply the developed theory to four relevant detection problems. We start with the classical Gaussian shift-in-mean problem. Then, we consider a scenario of specific relevance for sensor network applications, namely, detection with hardly (one-bit) quantized measurements. This case amounts to testing two Bernoulli distributions with different parameters under the different hypotheses. Both the Gaussian and the finite-alphabet assumptions are removed in the subsequent example, where a problem of relevance to radar applications is addressed, that is, shift-in-mean with additive noise sampled from a Laplace (double-exponential) distribution. Finally, we examine a case where the agents have limited knowledge of the underlying data model, and agree to employ a simple sample-mean detector, in the presence of noise distributed as a Gaussian mixture.

Figure 4: Network skeleton used for the numerical simulations.

Before dwelling on the presentation of the numerical experiments, we provide some essential details on the strategy that has been implemented for obtaining them:

  • The network used for our experiments consists of ten sensors, arranged so as to form the topology in Fig. 4, with combination weights following the Laplacian rule [8, 16].

  • The decision rule for the detectors is based on comparing the diffusion output to some threshold , namely,

    (49)

    where the decision regions are the same as in (48).

  • Selecting the threshold in (49) is a critical stage of detector design and implementation. This choice can be guided by different criteria, which would lead to different threshold settings. In the following examples, we present three relevant cases, namely: a threshold setting that is suited to the Bayesian and the max-min criteria (Sec. 6.2); a Neyman-Pearson threshold setting (Sec. 6.3); and a threshold setting in the presence of insufficient information about the underlying statistical models (Sec. 6.4). We would like to stress that using different threshold setting rules for different statistical models has no particular meaning. These choices are just meant to illustrate different rules and different models while avoiding repetition of similar results.

  • The diffusion output is obtained after consultation steps involving the exchange of some local statistics . The particular kind of statistic used in the different examples will be detailed when needed.

6.1 Shift-in-mean Gaussian Problem

The first hypothesis testing problem we consider is the following:

(50)
(51)

where denotes the local datum collected by sensor at time . We assume the local statistic to be shared during the diffusion process is the log-likelihood ratio of the measurement :

(52)

Note that in the Gaussian case the log-likelihood ratio is simply a shifted and scaled version of the collected observation , such that no substantial differences are expected if the agents share directly the observations.

In the specific case that is the log-likelihood ratio, the expectations and assume a peculiar meaning. Indeed, they can be conveniently represented as:

(53)

where , with , is the Kullback-Leibler (KL) divergence between hypotheses and  — see [40]. In particular, for the Gaussian shift-in-mean problem the distribution of the log-likelihood ratio can be expressed in terms of the KL divergences as follows:

(54)

where

(55)

is the KL divergence for the Gaussian shift-in-mean case [40].

Since the LMGF of a Gaussian random variable is  [42], we deduce from (54) that

(56)

Note that , a relationship that holds true more generally when working with the LMGFs of the log-likelihood ratio — see, e.g., [55]. Now, applying (45) to (56) readily gives

(57)

According to its definition (32), in order to find the Fenchel-Legendre transform we should maximize, with respect to , the function . In view of the convexity properties proved in Appendix C, this can be done by taking the first derivative and equating it to zero, which is equivalent to writing

(58)
(59)

These expressions lead to

(60)

Selecting the threshold within the interval , the minimization in (47) is easily performed — refer to Fig. 3 and the related discussion. The final result is:

(61)

These expressions provide the complete asymptotic characterization to the leading exponential order (i.e., they furnish the detection error exponents) of the adaptive distributed network of detectors for the Gaussian shift-in-mean problem, and for any choice of the threshold within the interval .

We have run a number of numerical simulations to check the validity of the results. Clearly, in order to show the generality of our methods, it is desirable to test them on non-Gaussian data as well. Since the interpretation of the results for both Gaussian and non-Gaussian data is essentially similar, we shall skip the numerical results for the Gaussian case to avoid unnecessary repetitions and focus on other cases. Accordingly, also the discussion on how to make a careful selection of the detection threshold is postponed to the forthcoming sections.

6.2 Hardly (one-bit) Quantized Measurements

We now examine the example in which the measurements at the local sensors are hardly quantized. This situation can be formalized as the following hypothesis test:

(62)
(63)

with denoting a Bernoulli random variable with success probability . As in the previous example, we assume that the local statistics employed by the sensors in the adaptation/combination stages are chosen as the local log-likelihood ratios that, in view of (62)–(63), can be written as:

(64)

where , with . Since , we see that is a binary random variable taking on the values or . The distribution of is then characterized by:

(65)

and, hence, the LMGFs for this example are readily computed:

(66)
(67)

According to the relationship (