Large Deviations Performance of Consensus+Innovations Distributed Detection with Non-Gaussian Observations

# Large Deviations Performance of Consensus+Innovations Distributed Detection with Non-Gaussian Observations

Dragana Bajović, Duan Jakovetić, José M. F. Moura, João Xavier, and Bruno Sinopoli The work of the first, second, fourth, and fifth authors is partially supported by grants CMU-PT/SIA/0026/2009, SFRH/BD/33517/2008 (through the Carnegie Mellon/Portugal Program managed by ICTI) and by grant PTDC/EEA-CRO/104243/2008 from Fundação para a Ciência e Tecnologia and also by ISR/IST plurianual funding (POSC program, FEDER). The work of the third author is partially supported by NSF under grants CCF-1011903 and CCF-1018509, and by AFOSR grant FA95501010291. Dragana Bajović and Duan Jakovetić hold fellowships from the Carnegie Mellon/Portugal Program.Dragana Bajović and Duan Jakovetić are with the Institute for Systems and Robotics (ISR), Instituto Superior Técnico (IST), Technical University of Lisbon, Lisbon, Portugal, and with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA, dbajovic@andrew.cmu.edu, djakovet@andrew.cmu.eduJosé M. F. Moura and Bruno Sinopoli are with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, 15213 PA, USA, moura@ece.cmu.edu, brunos@ece.cmu.eduJoão Xavier is with the Institute for Systems and Robotics (ISR), Instituto Superior Técnico (IST), Technical University of Lisbon, Lisbon, Portugal, jxavier@isr.ist.utl.pt
###### Abstract

We establish the large deviations asymptotic performance (error exponent) of consensus+innovations distributed detection over random networks with generic (non-Gaussian) sensor observations. At each time instant, sensors 1) combine theirs with the decision variables of their neighbors (consensus) and 2) assimilate their new observations (innovations). This paper shows for general non-Gaussian distributions that consensus+innovations distributed detection exhibits a phase transition behavior with respect to the network degree of connectivity. Above a threshold, distributed is as good as centralized, with the same optimal asymptotic detection performance, but, below the threshold, distributed detection is suboptimal with respect to centralized detection. We determine this threshold and quantify the performance loss below threshold. Finally, we show the dependence of the threshold and performance on the distribution of the observations: distributed detectors over the same random network, but with different observations’ distributions, for example, Gaussian, Laplace, or quantized, may have different asymptotic performance, even when the corresponding centralized detectors have the same asymptotic performance.

\xyoption

curve

Keywords: Consensus+innovations, performance analysis, Chernoff information, non-Gaussian distributions, distributed detection, random network, information flow, large deviations.

## I Introduction

Consider a distributed detection scenario where sensors are connected by a generic network with intermittently failing links. The sensors perform consensus+innovations distributed detection; in other words, at each time , each sensor updates its local decision variable by: 1) sensing and processing a new measurement to create an intermediate variable; and 2) weight averaging it with its neighbors’ intermediate decision variables. We showed in [1] that, when the sensor observations are Gaussian, the consensus+innovations distributed detector exhibits a phase transition. When the network connectivity is above a threshold, then the distributed detector is asymptotically optimal, i.e., asymptotically equivalent to the optimal centralized detector that collects the observations of all sensors.

This paper establishes the asymptotic performance of distributed detection over random networks for generic, non-Gaussian sensor observations. We adopt as asymptotic performance measure the exponential decay rate of the Bayes error probability (error exponent). We show that phase transition behavior emerges with non-Gaussian observations and demonstrate how the optimality threshold is a function of the log-moment generating function of the sensors’ observations and of the number of sensors . This reveals a very interesting interplay between the distribution of the sensor observations (e.g., Gaussian or Laplace) and the rate of diffusion (or connectivity) of the network (measured by a parameter defined in Section II): for a network with the same connectivity, a distributed detector with say, Laplace observations distributions, may match the optimal asymptotic performance of the centralized detector, while the distributed detector for Gaussian observations may be suboptimal, even though the centralized detectors for the two distributions, Laplace and Gaussian, have the same optimal asymptotic performance.

For distributed detection, we determine the range on the detection threshold for which each sensor achieves exponentially fast decay of the error probability (strictly positive error exponent), and we find the optimal that maximizes the error exponent. Interestingly, above the critical (phase transition) value for the network connectivity , the optimal detector threshold is , mimicking the (asymptotically) optimal threshold for the centralized detector. However, below the critical connectivity, we show by a numerical example that the optimal distributed detector threshold might be non zero.

Brief review of the literature. Distributed detection has been extensively studied, in the context of parallel fusion architectures, e.g., [2, 3, 4, 5, 6, 7, 8], consensus-based detection [9, 10, 11, 12], and, more recently, consensus+innovations distributed inference, see, e.g., [13, 14, 15, 16, 17] for distributed estimation, and [18, 19, 20, 21, 22, 23, 24] for distributed detection. Different variants of consensus+innovations distributed detection algorithms have been proposed; we analyze here running consensus, the variant in [20].

Reference [20] considers asymptotic optimality of running consensus, but in a framework that is very different from ours. Reference [20] studies the asymptotic performance of the distributed detector where the means of the sensor observations under the two hypothesis become closer and closer (vanishing signal to noise ratio (SNR)), at the rate of , where  is the number of observations. For this problem, there is an asymptotic, non-zero, probability of miss and an asymptotic, non-zero, probability of false alarm. Under these conditions, running consensus is as efficient as the optimal centralized detector, [25], as long as the network is connected on average. Here, we assume that the means of the distributions stay fixed as  grows. We establish, through large deviations, the rate (error exponent) at which the error probability decays to zero as  goes to infinity. We show that connectedness on average is not sufficient for running consensus to achieve the optimality of centralized detection; rather, phase change occurs, with distributed becoming as good as centralized, when the network connectivity, measured by , exceeds a certain threshold.

We distinguish this paper from our prior work on the performance analysis of running consensus. In [26], we studied deterministically time varying networks and Gaussian observations, and in [27], we considered a different consensus+innovations detector with Gaussian observations and additive communication noise. Here, we consider random networks, non-Gaussian observations, and noiseless communications. Reference [1] considers random networks and Gaussian, spatially correlated observations. In contrast, here the observations are non-Gaussian spatially independent. We proved our results in [1] by using the quadratic nature of the Gaussian log-moment generating function. For general non-Gaussian observations, the log-moment generating function is no longer quadratic, and the arguments in [1] no longer apply; we develop a more general methodology that establishes the optimality threshold in terms of the log-moment generating function of the log-likelihood ratio. We derive our results from generic properties of the log-moment generating function like convexity and zero value at the origin. Finally, while reference [1] and our other prior work considered zero detection threshold , here we extend the results for generic detection thresholds . Our analysis reveals that, when is above its critical value, the zero detector threshold is (asymptotically) optimal. When is below the critical value, we compute the best detector threshold , which may be non-zero in general.

Our analysis shows the impact of the distribution of the sensor observations on the performance of distributed detection: distributed detectors (with different distributions of the sensors observations) can have different asymptotic performance, even though the corresponding centralized detectors are equivalent, as we will illustrate in detail in Section IV.

Paper outline. Section II introduces the network and sensor observations models and presents the consensus+innovations distributed detector. Section III presents and proves our main results on the asymptotic performance of the distributed detector. For a cleaner exposition, this section proves the results for (spatially) identically distributed sensor observations. Section IV illustrates our results on several types of sensor observation distributions, namely, Gaussian, Laplace, and discrete valued distributions, discussing the impact of these distributions on distributed detection performance. Section V extends our main results to non-identically distributed sensors’ observations. Finally, Section VI concludes the paper.

Notation. We denote by: the -th entry of a matrix ; the -th entry of a vector ; , , and , respectively, the identity matrix, the column vector with unit entries, and the -th column of ; the ideal consensus matrix ; the vector (respectively, matrix) -norm of its vector (respectively, matrix) argument; the Euclidean (respectively, spectral) norm of its vector (respectively, matrix) argument; the -th largest eigenvalue; and the expected value and probability operators, respectively; the indicator function of the event ; the product measure of i.i.d. observations drawn from the distribution with measure ; and the first and the second derivatives of the function at point .

## Ii Problem formulation

This section introduces the sensor observations model, reviews the optimal centralized detector, and presents the consensus+innovations distributed detector. The section also reviews relevant properties of the log-moment generating function of a sensor’s log-likelihood ratio that are needed in the sequel.

### Ii-a Sensor observations model

We study the binary hypothesis testing problem versus . We consider a network of nodes where is the observation of sensor at time , where ,

###### Assumption 1

The sensors’ observations are independent and identically distributed (i.i.d.) both in time and in space, with distribution under hypothesis and under :

 Yi(t)∼{ν1,H1ν0,H0,i=1,…,N,t=1,2,… (1)

The distributions and are mutually absolutely continuous, distinguishable measures. The prior probabilities and are in .

By spatial independence, the joint distribution of the observations of all sensors

 Y(t):=(Y1(t),…,YN(t))⊤ (2)

at any time is under and under . Our main results in Section III are derived under Assumption 1. Section V extends them to non-identical (but still independent) sensors’ observations.

### Ii-B Centralized detection, log-moment generating function (LMGF), and optimal error exponent

The log-likelihood ratio of sensor at time is and given by

 Li(t)=logf1(Yi(t))f0(Yi(t)),

where, , is 1) the probability density function corresponding to , when is an absolutely continuous random variable; or 2) the probability mass function corresponding to , when is discrete valued.

Under Assumption 1, the log-likelihood ratio test for time observations from all sensors, for a threshold  is: 111In (3), we re-scale the spatio-temporal sum of the log-likelihood ratios by dividing the sum by . Note that we can do so without loss of generality, as the alternative test without re-scaling is: with

 D(k):=1Nkk∑t=1N∑i=1Li(t)[H0]H1≷γ. (3)

Log-moment generating function (LMGF). We introduce the LMGF of and its properties that play a major role in assessing the performance of distributed detection.

Let () denote the LMGF for the log-likelihood ratio under hypothesis :

 Λl:R⟶(−∞,+∞],Λl(λ)=logE[eλL1(1)|Hl]. (4)

In (4), replaces , for arbitrary , and , due to the spatial and temporal identically distributed observations, see Assumption 1.

###### Lemma 1

Consider Assumption 1. For and in (4) the following holds:

1. is convex;

2. , for , , and , ;

3. satisfies:

 Λ1(λ)=Λ0(λ+1),forλ∈R. (5)
###### Proof.

For a proof of (a) and (b), see [28]. Part (c) follows from the definitions of  and , which we show here for the case when the distributions and are absolutely continuous (the proof for discrete distributions is similar):

 Λ1(λ)=logE[eλL1(1)|H1] =log∫y∈R(f1(y)f0(y))λf1(y)dy =log∫y∈R(f1(y)f0(y))1+λf0(y)dy=Λ0(1+λ).

We further assume that the LMGF of a sensor’s observation is finite.

###### Assumption 2

, .

In the next two remarks, we give two classes of problems when Assumption 2 holds.

Remark I. We consider the signal+noise model:

 Yi(t)={m+ni(k),H1ni(k),H0. (6)

Here is a constant signal and is a zero-mean additive noise with density function supported on ; we rewrite , without loss of generality, as , where is a constant. Then, the Appendix shows that Assumption 2 holds under the following mild technical condition: either one of (7) or (8) and one of (9) or (10) hold:

 limy→+∞g(y)|y|τ+ = ρ+,forsomeρ+,τ+∈(0,+∞) (7) limy→+∞g(y)(log(|y|))μ+ = ρ+,forsomeρ+∈(0,+∞),μ+∈(1,+∞) (8) limy→−∞g(y)|y|τ− = ρ−, forsomeρ−,τ−∈(−∞,0) (9) limy→−∞g(y)(log(|y|))μ− = ρ−,forsomeρ−∈(0,−∞),μ−∈(1,+∞). (10)

In (8) and (10), we can also allow either (or both) to equal 1, but then the corresponding is in . Note that need not be symmetric, i.e., need not be equal to . Intuitively, the tail of the density behaves regularly, and grows either like a polynomial of arbitrary finite order in , or slower, like a power , , or like a logarithm . The class of admissible densities includes, e.g., power laws , , or the exponential families , , with: 1) the Lebesgue base measure ; 2) the polynomial, power, or logarithmic potentials ; and 3) the canonical set of parameters [29].

Remark II. Assumption 2 is satisfied if has arbitrary (different) distributions under and with the same, compact support; a special case is when is discrete, supported on a finite alphabet.

Centralized detection: Asymptotic performance. We consider briefly the performance of the centralized detector that will benchmark the performance of the distributed detector. Denote by , It can be shown [30] that and . Now, consider the centralized detector in (3) with constant thresholds , for all , and denote by:

 α(k,γ)=P(D(k)≥γ|H0),β(k,γ)=P(D(k)<γ|H1),:Pe(k,γ)=α(k,γ)π0+β(k,γ)π1, (11)

respectively, the probability of false alarm, probability of miss, and Bayes (average) error probability. In this paper, we adopt the minimum Bayes error probability criterion, both for the centralized and later for our distributed detector, and, from now on, we refer to it simply as the error probability. A standard Theorem (Theorem 3.4.3., [30]) says that, for any choice of , the error probability decays exponentially fast to zero in . For , the error probability does not converge to zero at all. To see this, assume that is true, and let . Then, by noting that , for all , we have that as , by the central limit theorem.

Denote by , the Fenchel-Legendre transform [30] of :

 Il(z)=supλ∈Rλz−Λl(λ),z∈R. (12)

It can be shown [30] that is nonnegative, strictly convex, , for , and [30]. We now state the result on the centralized detector’s asymptotic performance.

###### Lemma 2

Let Assumption 1 hold, and consider the family of centralized detectors (3) with constant threshold Then, the best (maximal) error exponent:

 limk→∞−1klogPe(k,γ)

is achieved for the zero threshold and equals where

The quantity is referred to as the Chernoff information of a single sensor observation Lemma 2 says that the centralized detector’ error exponent is times larger than an individual sensor’s error exponent. We remark that, even if we allow for time-varying thresholds , the error exponent cannot be improved, i.e., the centralized detector with zero threshold is asymptotically optimal over all detectors. We will see that, when a certain condition on the network connectivity holds, the distributed detector is asymptotically optimal, i.e., achieves the best error exponent , and the zero threshold is again optimal. However, when the network connectivity condition is not met, the distributed detector is no longer asymptotically optimal, and the optimal threshold may be non zero.

###### Proof of Lemma 2.

Denote by the LMGF for the log-likelihood ratio for the observations of all sensors at time . Then, , by the i.i.d. in space assumption on the sensors’ observations. The Lemma now follows by the Chernoff lemma (Corollary 3.4.6, [30]):

 limk→∞−1klogPe(k,0)=maxλ∈[0,1]{−Λ0,N(λ)}=Nmaxλ∈[0,1]{−Λ0(λ)}=NI0(0).

### Ii-C Distributed detection algorithm

We now consider distributed detection when the sensors cooperate through a randomly varying network. Specifically, we consider the running consensus distributed detector proposed in [20]. Each node maintains its local decision variable , which is a local estimate of the global optimal decision variable in (3). Note that is not locally available. At each time , each sensor updates in two ways: 1) by incorporating its new observation to make an intermediate decision variable ; and 2) by exchanging the intermediate decision variable locally with its neighbors and computing the weighted average of its own and the neighbors’ intermediate variables.

More precisely, the update of is as follows:

 xi(k)=∑j∈Oi(k)Wij(k)(k−1kxj(k−1)+1kLj(k)),k=1,2,...xi(0)=0. (13)

Here is the (random) neighborhood of sensor at time (including ), and are the (random) averaging weights. The sensor ’s local decision test at time is:

 xi(k)[H0]H1≷γ, (14)

i.e., (respectively, ) is decided when (respectively, ).

Write the consensus+innovations algorithm (13) in vector form. Let and . Also, collect the averaging weights in the matrix , where, clearly, if the sensors and do not communicate at time step . The algorithm (13) becomes:

 x(k)=W(k)(k−1kx(k−1)+1kL(k)),k=1,2,...xi(0)=0. (15)

Network model. We state the assumption on the random averaging matrices .

###### Assumptions 3

The averaging matrices satisfy the following:

1. The sequence is i.i.d.

2. is symmetric and stochastic (row-sums equal 1 and ) with probability one, .

3. There exists , such that, for any realization , , , and, whenever , .

4. and are mutually independent over all and .

Condition (c) is mild and says that: 1) sensor assigns a non-negligible weight to itself; and 2) when sensor receives a message from sensor , sensor assigns a non-negligible weight to sensor .

Define the matrices by:

 Φ(k,t):=W(k)W(k−1)...W(t),k≥t≥1. (16)

It is easy to verify from (15) that equals:

 x(k)=1kk∑t=1Φ(k,t)L(t),k=1,2,... (17)

Choice of threshold . We restrict the choice of threshold to , , , where we recall , Namely, is a stochastic matrix, hence , for all , and thus . Also, , for all , . Now, by iterating expectation:

 E[x(k)|Hl]=E[E[x(k)|Hl,W(1),...,W(k)]]=E[1kk∑t=1Φ(k,t)E[L(t)|Hl]]=γl1,l=0,1,

and , for all . Moreover, it can be shown (proof is omitted due to lack of space) that converges in probability to under . Now, a similar argument as with the centralized detector in II-B shows that for , the error probability does not converge to zero. We will show that, for any , the error probability converges to 0 exponentially fast, and we find the optimal that maximizes a certain lower bound on the exponent of the error probability.

Network connectivity. From (17), we can see that the matrices should be as close to as possible for enhanced detection performance. Namely, the ideal (unrealistic) case when for all , corresponds to the scenario where each sensor is equivalent to the optimal centralized detector. It is well known that, under certain conditions, the matrices converge in probability to :

 P(∥Φ(k,t)−J∥>ϵ)→0as(k−t)→∞,ϵ>0,

such that vanishes exponentially fast in , i.e., , . The quantity determines the speed of convergence of the matrices . The closer to zero is, the faster consensus is. We refer to as the network connectivity. We will see that the distributed detection performance significantly depends on . Formally, is given by:222It can be shown that the limit in (18) exists and that it does not depend on .

 |logr|:=lim(k−t)→∞−1k−tlogP(∥Φ(k,t)−J∥>ϵ). (18)

For the exact calculation of , we refer to [31]. Reference [31] shows that, for the commonly used models of , gossip and link failure (links in the underlying network fail independently, with possibly mutually different probabilities), is easily computable, by solving a certain min-cut problem. In general, is not easily computable, but all our results (Theorem 5, Corollary 6, Corollary 11) hold when is replaced by an upper bound. An upper bound on is given by [31].

The following Lemma easily follows from (18).

###### Lemma 4

Let Assumption 3 hold. Then, for any , there exists a constant (independent of ) such that:

 P(∥Φ(k,t)−J∥>ϵ)≤C(δ)e−(k−t)(|logr|−δ),forallk≥t.

## Iii Main results: Asymptotic analysis and error exponents for distributed detection

Subsection III-A states our main results on the asymptotic performance of consensus+innovations distributed detection; subsection III-B proves these results.

### Iii-a Statement of main results

In this section, we analyze the performance of distributed detection in terms of the detection error exponent, when the number of observations (per sensor), or the size of the observation interval tends to . As we will see next, we show that there exists a threshold on the network connectivity such that if is above this threshold, each node in the network achieves asymptotic optimality (i.e., the error exponent at each node is the total Chernoff information equal to ). When is below the threshold, we give a lower bound for the error exponent. Both the threshold and the lower bound are given solely in terms of the log-moment generating function and the number of sensors . These findings are summarized in Theorem 5 and Corollary 6 below.

Let , , and denote the probability of false alarm, the probability of miss, and the error probability, respectively, of sensor for the detector (13) and (14), for the threshold equal to :

 αi(k,γ)=P(xi(k)≥γ|H0),βi(k,γ)=P(xi(k)<γ|H1),Pe,i(k,γ)=π0αi(k;γ)+π1βi(k;γ), (19)

where, we recall, and are the prior probabilities.

###### Theorem 5

Let Assumptions 1-3 hold and consider the family of distributed detectors in (13) and (14) with . Let be the zero of the function:

 Δl(λ):=Λl(Nλ)−|logr|−NΛl(λ),l=0,1, (20)

and define , by

 γ−0 =Λ′0(λs0),γ+0=Λ′0(Nλs0)≥γ−0 (21) γ−1 =Λ′1(Nλs1),γ+1=Λ′1(λs1)≥γ−1. (22)

Then, for every , at each sensor , , we have:

 liminfk→∞−1klogαi(k,γ)≥B0(γ),liminfk→∞−1klogβi(k,γ)≥B1(γ), (23)

where

 B0(γ) =maxλ∈[0,1]Nγλ−max{NΛ0(λ),Λ0(Nλ)−|logr|}=⎧⎪ ⎪⎨⎪ ⎪⎩NI0(γ),γ∈(γ0,γ−0]NI0(γ−0)+Nλs0(γ−γ−0),γ∈(γ−0,γ+0)I0(γ)+|logr|,γ∈[γ+0,γ1) B1(γ) =maxλ∈[−1,0]Nγλ−max{NΛ1(λ),Λ1(Nλ)−|logr|}=⎧⎪ ⎪⎨⎪ ⎪⎩I1(γ)+|logr|,γ∈(γ0,γ−1]NI1(γ+1)+Nλs1(γ−γ+1),γ∈(γ−1,γ+1)NI1(γ),γ∈[γ+1,γ1).
###### Corollary 6

Let Assumptions 1-3 hold and consider the family of distributed detectors in (13) and (14) parameterized by detector thresholds . Then:

1.  liminfk→∞−1klogPe,i(k,γ)≥min{B0(γ),B1(γ)}>0, (24)

and the lower bound in (24) is maximized for the point 333As we show in the proof, such a point exists and is unique. at which

2. Consider , and let:

 thr(Λ0,N)=max{Λ0(Nλ∙)−NΛ0(λ∙),Λ0(1−N(1−λ∙))−NΛ0(λ∙)}, (25)

Then, when , each sensor with the detector threshold set to , is asymptotically optimal:

 limk→∞−1klogPe,i(k,0)=NCind.
3. When , for , irrespective of the value of (even when .)

Figure 1 (left) illustrates the error exponent lower bounds and in Theorem 5, while Figure 1 (right) illustrates the quantities in (21). ( See the definition of the function in (36) in the proof of Theorem 5.) We consider sensors and a discrete distribution of over a 5-point alphabet, with the distribution under , and under . We set here

Corollary 6 states that, when the network connectivity is above a threshold, the distributed detector in (13) and (14) is asymptotically equivalent to the optimal centralized detector. The corresponding optimal detector threshold is . When is below the threshold, Corollary 6 determines what value of the error exponent the distributed detector can achieve, for any given . Moreover, Corollary 6 finds the optimal detector threshold for a given ; can be found as the unique zero of the strictly decreasing function on , see the proof of Corollary 6, e.g., by bisection on .

Remark. When , for , it can be shown that , and , for all . This implies that the point at which and are equal is necessarily zero, and hence the optimal detector threshold , irrespective of the network connectivity (even when .) This symmetry holds, e.g., for the Gaussian and Laplace distribution considered in Section IV.

Corollary 6 establishes that there exists a “sufficient” connectivity, say , so that further improvement on the connectivity (and further spending of resources, e.g., transmission power) does not lead to a pay off in terms of detection performance. Hence, Corollary 6 is valuable in the practical design of a sensor network, as it says how much connectivity (resources) is sufficient to achieve asymptotically optimal detection.

Equation (24) says that the distribution of the sensor observations (through LMGF) plays a role in determining the performance of distributed detection. We illustrate and explain by examples this effect in Section IV.

### Iii-B Proofs of the main results

We first prove Theorem (5).

###### Proof of Theorem 5.

Consider the probability of false alarm in (19). We upper bound using the exponential Markov inequality [32] parameterized by :

 (26)

Next, by setting , with , we obtain:

 αi(k,γ) ≤ E[eNkλxi(k)|H0]e−Nkλγ (27) = E[eNλ∑kt=1∑Nj=1Φi,j(k,t)Lj(t)|H0]e−Nkλγ. (28)

The terms in the sum in the exponent in (28) are conditionally independent, given the realizations of the averaging matrices , , Thus, by iterating the expectations, and using the definition of in (4), we compute the expectation in (28) by conditioning first on , :

 E[eNλ∑kt=1∑Nj=1Φi,j(k,t)Lj(t)|H0] = E[E[eNλ∑kt=1∑Nj=1Φi,j(k,t)Lj(t)|H0,W(1),…,W(k)]] (29) = E[e∑kt=1∑Nj=1Λ0(NλΦi,j(k,t))].

Partition of the sample space. We handle the random matrix realizations , , through a suitable partition of the underlying probability space. Adapting an argument from [1], partition the probability space based on the time of the last successful averaging. In more detail, for a fixed , introduce the partition of the sample space that consists of the disjoint events , , given by:

 As,k={∥Φ(k,s)−J∥≤ϵand∥Φ(k,s+1)−J∥>ϵ},

for , , and . For simplicity of notation, we drop the index in the sequel and denote event by , . for . Intuitively, the smaller is, the closer the product to is; if the event occurred, then the largest for which the product is still -close to equals . We now show that is indeed a partition. We need the following simple Lemma. The Lemma shows that convergence of is monotonic, for any realization of the matrices

###### Lemma 7

Let Assumption 3 hold. Then, for any realization of the matrices :

 ∥Φ(k,s)−J∥≤∥Φ(k,t)−J∥,for1≤s≤t≤k.
###### Proof.

Since every realization of is stochastic and symmetric for every , we have that and , and, so: . Now, using the sub-multiplicative property of the spectral norm, we get

 ∥Φ(k,s)−J∥ =∥(W(k)−J)⋯(W(t)−J)(W(t−1)−J)⋯(W(s)−J)∥ ≤∥(W(k)−J)⋯(W(t)−J)∥∥(W(t−1)−J)∥⋯∥(W(s)−J)∥.

To prove Lemma 7, it remains to show that , for any realization of . To this end, fix a realization of . Consider the eigenvalue decomposition , where is the matrix of eigenvalues of , and the columns of are the orthonormal eigenvectors. As is the eigenvector associated with eigenvalue , we have that where . Because is stochastic, we know that , and so

To show that is a partition, note first that (at least) one of the events necessarily occurs. It remains to show that the events are disjoint. We carry out this by fixing arbitrary , and showing that, if the event occurs, then , , does not occur. Suppose that occurs, i.e., the realizations are such that and . Fix any Then, event does not occur, because, by Lemma 7, Now, fix any Then, event does not occur, because, by Lemma 7, Thus, for any , if the event occurs, then , for , does not occur, and hence the events are disjoint.

Using the total probability law over , the expectation (29) is computed by:

 E[e∑kt=1∑Nj=1Λ0(NλΦi,j(k,t))]=k∑s=0E[e∑kt=1∑Nj=1Λ0(NλΦi,j(k,t))IAs], (30)

where, we recall, is the indicator function of the event . The following lemma explains how to use the partition to upper bound the expectation in (30).

###### Lemma 8

Let Assumptions 1-3 hold. Then:

1. For any realization of the random matrices , :

 N∑j=1Λ0(NλΦi,j(k,t))≤Λ0(Nλ),∀t=1,…,k.
2. Further, consider a fixed in . If the event occurred, then, for :