# Sensor Networks: from Dependence Analysis Via Matroid Bases to Online Synthesis^{1}^{1}1An early version of the work appeared in the th International Symposium on
Algorithms for Sensor Systems, Wireless Ad Hoc Networks and Autonomous Mobile Entities, ALGOSENSORS 2011, Saarbruecken, Germany.^{2}^{2}2This work was partially supported by MAFAT, the Chief Scientist (Magnet - Captain) and ISF.

###### Abstract

Consider the two related problems of sensor selection and sensor fusion. In the first, given a set of sensors, one wishes to identify a subset of the sensors, which while small in size, captures the essence of the data gathered by the sensors. In the second, one wishes to construct a fused sensor, which utilizes the data from the sensors (possibly after discarding dependent ones) in order to create a single sensor which is more reliable than each of the individual ones.

In this work, we rigorously define the dependence among sensors in terms of joint empirical measures and incremental parsing. We show that these measures adhere to a polymatroid structure, which in turn facilitates the application of efficient algorithms for sensor selection. We suggest both a random and a greedy algorithm for sensor selection. Given an independent set, we then turn to the fusion problem, and suggest a novel variant of the exponential weighting algorithm. In the suggested algorithm, one competes against an augmented set of sensors, which allows it to converge to the best fused sensor in a family of sensors, without having any prior data on the sensors’ performance.

###### keywords:

Sensor networks, Dependence analysis, Polymatroids, Matroid optimization, Randomized algorithms, Greedy selection, Empirical measures, Lempel-Ziv, Incremental parsing, Online fusion^{†}

^{†}journal: Theoretical Computer Science

url]www.bgu.ac.il/ coasaf

url]www.cs.bgu.ac.il/ dolev

## 1 Introduction

Sensor networks are used to gather and analyze data in a variety of applications. In this model, numerous sensors are either spread in a wide area, or simply measure different aspects of a certain phenomenon. The goal of a central processor which gathers the data is, in general, to infer about the environment the sensors measure and make various decisions. An example to be kept in mind can be a set of sensors monitoring various networking aspects in an organization (incoming and outgoing traffic, addresses, remote procedure calls, http requests to servers and such). In many such cases, an anomalous behavior detected by a single sensor may not be reliable enough to announce the system is under attack. Moreover, different sensors might have correlated data, as they measure related phenomenons. Hence, the central processor faces two problems. First, how to identify the set of sensors which sense independent data, and discard the rest, which only clutter the decision process. The second, how to intelligently combine the data from the sensors it selected in order to decide whether to raise an alarm or not.

In this work, we target both problems. First, we consider the problem of sensor selection. Clearly, as data aggregated by different sensors may be highly dependent, due to, for example, co-location or other similarities in the environment, it is desirable to identify the largest set of independent (or nearly independent) sensors. This way, sensor fusion algorithms can be much more efficient. For example, in the fusion algorithm we present, identifying the set of independent sensors allows us to create families of fused sensors based on fewer sensors, hence having a significantly smaller parameter space.
Moreover, identifying independent sensors is of benefit also to various control methods, were a few representative independent inputs facilitate easier analysis. Note that the sensor selection problem is different from the data compression problem, where the dependence among the data sets is reduced via some kind of Slepian-Wolf coding Slepian and Wolf (1973). Herein, we do not wish all data to be reconstructed at the center, but focus only identify good sets of independent sensors, such that *their* data can be analyzed, disregarding other sensors. In other words, we do not wish to replace Slepian-Wolf coding by sending data of independent sensors, only identify the independent subsets. For example, the randomized algorithm we suggest gathers data only from small subsets of the sensors, yet is assured to identify independent sets with high probability. In a similar manner, A greed algorithm we suggest can identify subset of sensors with relatively high independence among them (compared to other subsets), even in cases we do not wish to identify a subset containing *all* the information.

Given two data sets, a favored method to measure their dependence is through various *mutual information estimates*. Such estimates arise from calculating marginal and joint empirical entropies, or the more efficient method of incremental (Lempel-Ziv) parsing Ziv and Lempel (1978). Indeed, LZ parsing was used, for example, for multidimensional data analysis Zozor et al. (2005), neural computation Blanc et al. (2008) and numerous other applications. However, although the ability of the parsing rule to approximate the true entropy of the source, and hence, as one possible consequence, identify dependencies in the data, applications reported in the current literature were ad-hoc, using the resulting measures to compare between mainly *pairs of sources*.

To date, there is no rigorous method to analyze independence among large sets, and handle cases where one sensor’s data may depend on measurements from many others, including *various delays*. In this work, we give the mathematical framework which enables us to both rigorously define the problem of identifying sets of independent sources in a large set of sensors and give highly efficient approximations algorithms based on the observations we gain.

Still, when no single sensor is reliable enough to give an accurate estimate of the phenomenon it measures, sensor fusion is used Hall and Llinas (1997); Sasiadek (2002); Waltz (1986). In the second part of this work, we consider the problem of sensor fusion. In this case, for a given set of sensors, one wishes to generate a *new sensor*, whose performance over time is (under some measure) better than any single sensor in the set. Clearly, in most cases, choosing the best-performing sensor in the set might not be enough. We wish, in general, to create a sensor whose performance is strictly better than any given sensor in the original set, by utilizing data from several sensors simultaneously and intelligently combining it.

### 1.1 Contribution.

Our main contributions are the following. First, we show how to harness the wide variety of algorithms for identifying largest independent sets in matroids, or the very related problems of minimum cycle bases and spanning trees in graphs to our problem of identifying sets of independent sensors in a sensor network. Our approach is based on highly efficient (linear time in the size of the data) methods to estimate the dependence among the sensors, such as the Lempel-Ziv parsing rule Ziv and Lempel (1978). The key step, however, is in showing how these estimates can either yield a polymatroid, or at least approximate a one, thus facilitating the use of polynomial time algorithms to identify the independent sets, such as Karger (1993); Berger et al. (2004). We construct both random and a greedy selection algorithms, and analyze their performance.

We then turn to the problem of (non-correlated) sensor fusion. In particular, we describe an online fusion algorithm based on exponential weighting Vovk (1990).
While weighted majority algorithms were used in the context of sensor networks Jeon and Landgrebe (1999); Yu and Sycara (2006); Polikar et al. (2006), in these works, the exponential weighting was used only to identify good sensors and order them by performance. Hence, applied directly, this algorithm does not yield a good *fused* sensor. In this part of our work, we suggest a novel extension by creating parametric families of synthesised sensors. This way, we are able to *span a huge set of fused sensors*, and choose online the best fused sensor. That is, given a set of sensors , this algorithm *constructs synthesised sensors*, from which it selects a sensor whose performance converges to that of the best sensor in both and the constructed parametric family of synthesised sensors. In other words, this algorithm results in online sensor fusion.

We rigorously quantify the regret of the suggested algorithm compared to the best fused sensor. In this way, a designer of a sensor fusion algorithm has a well-quantified trade-off: choosing a large number of parameters, thus covering more families of fusion possibilities, at the price of higher regret.

## 2 Preliminaries

The basic setting we consider is the following. A set of sensors, is measuring a set of values in a certain environment. Each sensor may depend on a different set of values, and may base its decision on these values in a different way. However, each sensor, at each time instance, estimates whether a target exists in the environment or not. Thus, the input to sensor at time is some vector of measurements , based on which it will output a value , which is his estimate for the probability a target exists at time .^{3}^{3}3The model of target identification and the requirement for is mainly for illustration purposes. The algorithms we describe are suitable for different sets as well, with the adequate quantization or scaling. Throughout, capital letters denote random variables while lower case denotes realizations. Hence, denotes the possibly random output of sensor , at time , .

Let be the binary sequence indicating whether a target actually appeared at time or not. The normalized cumulative loss of the sensor over time instances is defined as

(1) |

for some distance function . If the sensor’s output is binary (a sensor either decides a target exists or not), then and a reasonable distance measure is the Hamming distance, that is,

If the sensor’s output is in , then we think of it as the sensor’s estimate for the probability a target exists, and a reasonable is the *log-loss*,

In any case, the goal of a good sensor is to minimize the normalized cumulative loss , as given by (1). Roughly speaking, in the first part of this paper, we wish to use the estimates to identify dependencies among the sensors, and in the second part we wish to construct fused sensors, whose cumulative loss is lower than any given sensor.

### 2.1 Polymatroids, Matroids and Entropic Vectors.

Let be an index set of size and be the power set of . A function defines a *polymatroid* with a ground set and rank function if it satisfies the following conditions Oxley (1992):

(2) | |||||

(3) | |||||

(4) |

For a polymatroid with ground set , we represent by the vector defined on the ordered, non-empty subsets of . We denote
the set of all polymatroids with a ground set of size by . Thus if and only if and satisfy equations (2)–(4) for all , where is the value of at the entry corresponding to the subset . If, in addition to (2)–(4), and , then is called a *matroid*.

Now, assume is some set of discrete random variables. For any , let denote the joint entropy function. An entropy vector is a -dimensional vector whose entries are the joint entropies of all non-empty subsets of . It is well-known that the entropy function is a polymatroid over this ground set . Indeed, (2)–(4) are equivalent to the Shannon information inequalities Yeung (2002). However, there exists points () for which there is no set of discrete random variables whose joint entropies equal . Following Chan and Grant (2008), we denote by the set of all for which there exists at least one random vector whose joint entropies equal . A is called *entropic*.
Finally, denote by the convex closure of . Then for but for Yeung (2002).

## 3 A Matroid-Based Framework for Identifying Independent Sensors

In this section, we use the incremental parsing rule of Lempel and Ziv Ziv and Lempel (1978) to estimate the *joint empirical entropies* of the sensors’ data. We then show that when the sensors data is stationary and ergodic, the vector of joint empirical entropies can be approximated by some point in the polyhedral cone . In fact, this point is actually in . As asymptotically entropic polymatroids are well approximated by asymptotically entropic matroids (Matúš, 2007, Theorem 5), the point in which corresponds to the joint empirical entropies of the sensors is approximated by the *ranks of some matroid*. This enables us to identify independent sets of sensors, and, in particular, largest independent sets, by identifying the bases (or circuits) of the matroid.
Doing this, the most complex dependence structures among sensors, including both dependence between past/future data and dependence among values at the same time instant can be identified. Non-linear dependencies are also captured.

We now show how to approximate an entropy vector (hence, a polymatroid) for the sensor data. We prove that indeed for large enough data and ergodic sources the approximation error is arbitrarily small. This polymatroid will be the input from which we will identify the independent sensors.

We first consider the most simple case in which one treats the sensors as having memoryless data. That is, sensors for which each reading (in time) is independent of the previous or future readings. Note, however, that this model still allows the reading of a sensor to depend on the readings of other sensors *at that time instant*. The dependence might be a simple (maybe linear) dependence between two sensors, or a more complex one, where one sensor’s output is a random function of the outputs of a few others. It is important to note that it is inconsequential if the sensors are indeed memoryless or not. Using this simplified method, only dependencies across a single time instant will be identified. A generalization for time-dependent data appears in the next sub-section.

For the sake of simplicity, assume now all are binary. Given a sequence , denote by and the number of zeros and ones in , respectively. That is,

where is the indicator function. When the sequence indices are clear from the context, we will abbreviate this by . Hence,

denotes the *type* of the sequence , that is, its empirical frequencies Cover and Thomas (2006).

In a similar manner, we define the empirical frequencies of several sequences together, e.g. pairs. For example,

In this case, the -tuple

denotes the *joint type* of , hence, it includes the empirical frequencies of the two sequences *together*, over their product alphabet . For more than two sequences, we denote by the joint type of the sequences .

For a probability vector , let denote its entropy, that is,

Let be the -dimensional vector whose entries are all the joint *empirical entropies* calculated from . I.e,

Under these definitions, we have the following.

###### Proposition 1.

For every realization of the sensors’ data, .

###### Proof.

We wish to show that the vector of joint empirical entropies, , is entropic for any finite . Hence, . The important observation is that empirical measures (as defined herein) are legitimate probability measures (even if the approximation error compared to the true measure is large), hence entropies calculated from them give rise to an entropic polymatroid.

Since for any subset , clearly defines a valid distribution (all entries are in and they sum up to ), consistency is the only property it remains to show. Assume all sensors’ output belong to some finite alphabet . We have

which completes the proof. ∎

Let denote the true (memoryless) entropy vector of the sources. That is,

For stationary and ergodic sources, the following Proposition is a direct application of Birkhoff’s ergodic theorem.

###### Proposition 2.

Let be drawn from a stationary and ergodic source with some probability measure . Then, for any subset , we have -a.s. (almost surely). As a result, .

That is, the entropy calculated from the empirical distribution converges to the true entropy. Moreover, the vector of empirical entropies converges almost-surely (a.s.) to the true entropy vector, which is, of course, an entropic polymatroid. To be able to harness the diverse algorithmic literature on matroids (such as matroid optimization relevant for our independence analysis application), we mention that by (Matúš, 2007, Theorem 5), describing the cone of asymptotically entropic polymatroids, , is reduced to the problem of describing asymptotically entropic *matroids*.

### 3.1 Dependence Measures for Sensors with Memory.

Till now, we considered sensors for which the data for any *individual* sensor is a stationary and ergodic process, yet, through first-order empirical entropies, only the dependence along a single time instant was estimated. While being very easy to implement (linear in the size of the data), this method fails to capture complex dependence structures. For example, consider a sensor whose current data depends heavily on *previous data* acquired by *one or several other* sensors.

To capture dependence in time, we offer the incremental parsing rule Ziv and Lempel (1978) as a basis for an empirical measure. We show that indeed such a measure will converge almost surely to a polymatroid, from which maximal independent sets can be approximated. We start with a few definitions.

Let be some sequence over a finite alphabet of size . The ZL78 Ziv and Lempel (1978) parsing rule is a sequential procedure which parses the sequence in a way where a new phrase is created as soon as the still unparsed part of the string differs from all preceding phrases. For example, the string

is parsed as

Let denote the number of distinct phrases whose concatenation generates . Furthermore, let denote the compression ratio achieved by the best finite-state encoder with at most state, and define

In a nutshell, the main results of Ziv and Lempel (1978) states that on the one hand

where is the alphabet size. On the other hand, for any sequence , there exists a finite state encoder with a compression ratio satisfying

Thus,

is an asymptotically attainable lower bound on the compression ratio . Denote by the *entropy rate* of a stationary source , that is,

. For sources , the entropy rate vector is defined as

Analogously to the memoryless case, herein we also define the joint parsing rule in the trivial way, that is, parsing any subset of sequences as a single sequence over the product alphabet. Define the LZ-based estimated entropy vector as (suppressing the dependence on )

The following is the analogue of Proposition 2 for the non-memoryless case.

###### Proposition 3.

Let be drawn from a stationary and ergodic source . Then, and we have

###### Proof.

We wish to see that converges to , and that indeed . It is not hard to show that . To see this, remember that , ranging over all subsets forms an entropic polymatroid Yeung (2002). Hence forms an asymptotically entropic polymatroid (as the closure of the entropic region is convex), hence .

Note, however, that the analogue of Proposition 1 is not true in this case. That is, for finite , might not satisfy the polymatroid axioms at all. Nevertheless, by Proposition 3, for large enough , is sufficiently close to . A fortiori, it is sufficiently close to . Moreover, for ergodic sources with finite memory, namely, sources for which

for some finite , there exist a few strong tail bounds on the probability that the LZ compression ratio exceeds a certain threshold. For example, if denotes the maximal entry in , we have the following proposition.

###### Proposition 4.

Let be drawn from a stationary and ergodic Markov source . Then, with probability at least , .

###### Proof.

By (Savari, 1997, Corollary 2), we have

Remembering that for any and using the union bound on all entries of results in

which completes the proof. ∎

The usefulness of Proposition 4 is twofold. First, it gives a practical bound on the approximation the vector gives to . However, assume is a matroid. This is the case, for example, when bits in the sensors’ data are either independent or completely dependent (in fact, in this case is a *linearly representable* binary matroid). Since might not satisfy the polymatroid axioms at all, using Proposition 4 one can then easily check when can the entries of be rounded to the nearest integer in order to achieve exactly.

###### Remark 1.

We mention that a different approach to target sensors with memory is to calculate *high order* empirical entropies, that is, entropies calculated from the frequency count of the data seen by a sliding window of a fixed length . With this approach, the achieved vector is entropic (hence a polymatroid) for any finite . Moreover, with a good tail bound such as Lezaud (1998) for irreducible Markov chains over a finite alphabet, we are able to show fast convergence to the true vales. The complexity, however, grows exponentially with . Thus, approaching entropy rates in order to capture long-time dependencies is of exponential complexity. In the LZ method we suggest, while the alphabet size indeed grows exponentially, complexity is a function of the alphabet size.

## 4 Identifying Independent Sets of Sensors.

When the number of sensors is small, and the complexity of calculating all entries of is reasonable, one can find a subset with high enough entropy (strong independence) by simply taking the smallest set of sensors with high enough . However, when the number of sensors is larger (even a few dozens), this method is prohibitively complex, and more suffisticated algorithms (and their analysis) are required.

Thus, having set the ground, in this section we utilize optimization algorithms for submodular functions, and matroids in particular, in order to find maximal independent sets of sensors efficiently. Herein, we include two examples: a random selection algorithm, which fits cases where true data forms a matroid, for which possibly many subsets of sensors include the desired data, and a greedy algorithm, which easily fits any dependence structure (while matroids asymptotically span the entropic cone, an additional approximation step is required Matúš (2007)). It is important to note that, unlike the greedy selection (also used in Shamaiah et al. (2010) in the context of maximum a posteriori estimates) which approximates the optimum value *up to a constant factor*, the random selection process we suggest here can guarantee exact approximation.

The randomized algorithm is given in Algorithm RandomSelection. As simple as it looks, by Proposition 4 and (Karger, 1993, Theorem 5.2), under mild assumptions on the true distribution of the data, it guarantees that indeed with high probability such a random selection produces a subset of sensors which is a -fraction of the original, yet if the original contains enough bases (maximal independent sets), then the subset contains a base as well. This is summarized in the following corollary.

###### Corollary 1.

Let be drawn from a stationary and ergodic Markov source. Assume that is a matroid of rank which contains disjoint bases. Then, with probability at least , the subset produced by Algorithm RandomSelection contains a maximal independent set of sensors.

At first sight, Algorithm RandomSelection does not depend on any of the discussed dependence measures in this paper. Yet, it power *is drawn from them*: once we have established the estimated entropy vector as the key variable in determining dependence, we know that this asymptotic matroid is the one we should analyze for independent sets, *according to its features we should choose the parameters* in RandomSelection and these features will indeed eventually determine the success probability of RandomSelection.

On the other hand, algorithm GreedySelection takes a different course of action, to answer a slightly different question: how to choose a small set of sensors with a relatively hight entropy (hence, independence)? How bad can one subset of sensors be compared to another of the same size? What is a good method to choose the better one? The algorithm sequentially increases the size of the sensors set until its entropy estimate does not grow. In a similar manner, one can choose empirical entropies. Due to the polymatroid properties we proved in the previous section, a bound on the performance compared to the optimum can be given.

The LZ parsing rule on an alphabet of size can be implemented in time (using an adequate tree and a binary enumeration of the alphabet). Hence, the complexity of GreedySelection is . Nemhauser and Wolsey (1978) analyzed the performance of greedy schemes for submodular functions. As noted in Shamaiah et al. (2010) also for such algorithms, they achieve a factor of of the optimum.

In practice, it might be beneficial to stop the algorithm if the entropy estimate does not grow above a certain threshold, to avoid steps which may include only a marginal improvement. In fact, this is exactly where the polymatroid properties we proved earlier kick in, and we have the following.

###### Proposition 5.

Assume Algorithm GreedySelection is stopped after the first time was incremented by less than some . Then, for stationary and ergodic sources, the difference between the entropy of the currently selected subset of sensors and the entropy that could have been reached if the algorithm concluded is upper bounded by .

###### Proof.

We wish to prove that if at some stage of the algorithm the improvement was some , then no further step can improve by more than , and hence the total improvement (till completion) is bounded by about . To show this, we use the polymatroid axiom, and the fact that the LZ parsing rule estimates the entropy up to an additive estimation error of (as increases).

Let be the estimated entropy at step of the algorithm, and be the estimated entropy at step . We know that

Also, for stationary and ergodic sources, with high probability,

for any subset . Assume that at step the algorithm added a sensor to the set , such that

In this case, we have

Hence, selecting instead of at step would have been a better choice, which contradicts the greedy nature of the algorithm: we assumed was selected to maximize over all possible . As a result, no further step can improve by more than , and since there are at most steps left, the proposition follows. ∎

## 5 A Sensor Fusion Algorithm Via Exponential Weighting

In this section, we present an online algorithm for sensor fusion. In Vovk (1990), Vovk considered a general set of experts and introduced the *exponential weighting*
algorithm. In this algorithm, each expert is assigned a weight
according to its past performance. By decreasing the weight of poorly
performing experts, hence preferring the ones proved to perform well thus
far, one is able to compete with the best expert, having neither any
*a priori* knowledge on the input sequence nor which expert will
perform the
best. This result was further extended in Littlestone and Warmuth (1994),
where various aspects of a “weighted majority” algorithm were discussed. In Cesa-Bianchi et al. (1997); Haussler et al. (1998); Cesa-Bianchi and Lugosi (1999), lower bound on the
redundancy of any universal algorithm were given, including very general loss functions. It is important to
note that the exponential weighting algorithm assumes nothing on the
set of experts, neither their distribution in the space of all
possible experts nor their structure. Consequently, all the results
are of the “worst case” type. Additional results regarding a randomized algorithm for expert selection can be found in Gyorfi et al. (1999) and Vovk (1998).

The exponential weighting algorithm was found useful also in the lossy
source coding works of Linder and Lugosi Linder and Lugosi (2001), Weissman
and Merhav Weissman and Merhav (2002), Gyorgy *et*. *al*. Gyorgy et al. (2004) and the derivation of sequential strategies
for loss functions with memory Merhav et al. (2002). A common
method in these works is the alternation of experts only once every block
of input symbols, necessary to bear the price of this change (e.g.,
transmitting the description of the chosen quantizer
Linder and Lugosi (2001)-Gyorgy et al. (2004)). A major drawback of all the above
algorithms is the need to compute the performance of each expert at every
time instant. In Gyorgy et al. (2004), though, Gyorgy
*et*. *al*. exploit the
structure of the experts (as they are all quantizers) to introduce an
algorithm which efficiently computes the performance (or an approximation of it) of each
expert at each stage.

In this work, we offer to use a sequential strategy similar to the one used for loss functions with memory Merhav et al. (2002) and scanning of multidimensional data Cohen et al. (2007, 2008) in order to weight the sensors and identify the best fused sensor. However, given a set of sensors , our goal is to construct a *new sensor*, , whose output depends on the outputs of the given sensors, yet its performance is better than the best sensor in the set .
We call a *synthesised (fused) sensor*. Clearly, when the true target appearance sequence is known in advance, suggesting such a sensor is trivial. However, we are interested in an *online* algorithm, which receives the sensors’ outputs at each time instant , together with their *performance in the past* (calculated by having access to for or estimating it), and computes a synthesised output. We expect the sequence of synthesised outputs given by the algorithm at times to have a lower cumulative loss than the best sensor in , for *any possible sequence and any set of sensors *.

Towards this goal, we will define a parametric set of synthesised sensors. Once such a set is constructed, say for some set of parameters (that is, possible new sensors), we will use the online algorithm to compete with the *best sensor in *. Clearly, a good choice for is such that on the one hand is not too large, yet on the other hand includes “enough” good synthesised sensors, so the best sensor in will indeed perform well.

###### Example 1.

A simple example to be kept in mind is a case where the set of sensors, , has the property such that all under-estimate the probability that a target exists (for example, since each sensor measures a different aspect of the target, which might not be visible each time the target appears). In this case, a sensor whose output at time is will have a much smaller cumulative loss compared to any individual sensor, . As a result, when designing families of synthesised sensors for such a set of sensors, one can think of a set synthesised family , which includes, for example, all sensors of the type for some subset . If the miss-detection probabilities of the sensors are not all equal, clearly some synthesised set of sensors will perform better than the others.

This example can be easily extended to a case where sensors either under-estimate or over-estimate. Following a single sensor will give a non-negligible error, while a simple median filter (sensor-wise) on a sufficiently large set of sensors might give asymptotically zero error.

### 5.1 Exponential Weighting for a Parametric Family of Sensors.

Recall that for any time instant , denotes the intermediate normalized cumulative loss of sensor . Hence, is simply the unnormalized cumulative loss until (and including) time instant . For simplicity, we denote this loss by . Furthermore, note that for each , . At each time instant , the exponential weighing algorithm assigns each sensor a probability . That is, it assumes the cumulative losses of all sensors up to time are known. Then, at each time instant , after computing , the algorithm selects a sensor in according to that distribution. The selected sensor is used to compute the *algorithm output at time *, namely, the algorithm uses the selected sensor as the synthesised sensor at time . Note that this indeed results in a synthesised sensor, as even if it turns out that the best sensor at some time instant is in , it is not necessarily always the same sensor, hence the algorithm output will probably not equal any fixed sensor for all time instances . The suggested algorithm is summarized in Algorithm OnlineFusion below.

The main advantage in this algorithm is that, under mild conditions, the normalized cumulative loss of the synthesised sensor it produces is approaching that of the best sensor in , hence it converges to the best synthesised sensor in a family of sensors, without knowing in advance which sensor that might be. By the standard analysis of exponential weighing, the following proposition holds.

###### Proposition 6.

For any sequence , any set of sensors of size and any set of synthesised sensors , the expected performance of Algorithm OnlineFusion is given by , where the expectation is over the randomized decisions in the algorithm and is some upper bound on the instantaneous loss.

For completeness, a proof is given in A. As a result, as long as the synthesised sensor has a vanishing redundancy compared to the best sensor in . This gives us an enormous freedom in choosing the parametrized set of sensors , and even sets whose size grows polynomially with the size of the data are acceptable.

The performance of the exponential weighting algorithm can be summarized as follows. For any set of stationary sources with probability measure , as long as the number of synthesised sensors does not grow exponentially with the data, we have

where the inner expectation in the left hand side is due to the possible randomization in . When the algorithm bases its decisions on independent drawings, we have

almost surely (in terms of the randomization in the algorithm). If, furthermore, the sources are strongly mixing, almost sure convergence in terms of the sources distribution is guaranteed as well Cohen et al. (2007):

A by product of the algorithm is the set of weights it maintains while running. These weights are, in fact, good estimates of the *sensors’ reputation*. Moreover, such weights can help us make intelligent decisions for synthesised control and fine-tuning of the sensor selection process, namely, we are able to clearly see which families of synthesised sensors perform better, and within a family, which set of parameters should be described in higher granularity compared to the others (since sensors with these values perform well).
Finally, note that this is a *finite horizon* algorithm, since the optimal depends on the size of the data, . One can loose the dependence on the size of the data easily by working with exponentially growing blocks of data.

## 6 Results on Real and Artificial Data

To validate the proposed methods in practice, simulations were carried out on both real and synthetic data. We present here some of the results.

To demonstrate Algorithm OnlineFusion, We used real sensors data collected from 54 sensors deployed in the Intel Berkeley Research Lab between February 28th and April 5th, 2004.^{4}^{4}4For details, see http://db.csail.mit.edu/labdata/labdata.html. To avoid too complex computations, we used only the first real sensors (corresponding to a wing in the lab) and artificially created from them fused (synthesised) sensors. For this basic example, the fused sensors were created by simply averaging the data of any two real sensors. Yet, the results clearly show how the best fused sensor outperforms the best real sensor, with very fast convergence times. Figure 1 demonstrates the convergence of the weight vectors created by the algorithm. At start (left column), all weights are equal. Very fast, the two best sensors have a relatively high weight (approximately ), while the weight of the others decrease exponentially. Hence, the algorithm identifies the two best sensors very fast. The two best sensors are indeed synthesised ones, with the real sensors performing much worse. Note that there was no real data () for this sample. The real data was artificially created from *all sensors* with a more complex function than simple average (first, artifacts where removed, then an average was taken). Thus, an average over simply two sensors, yet the best two sensors, outperforms any single one, and handles the artifacts in the data automatically.
Figure 2 depicts the data of two random real sensors (to avoid cluttering the graph), the artificially created true data and the best synthesised sensor.

Method | Sensor Numbers | Entropy Estimate |
---|---|---|

Max. triplet | 1, 2, 8 | 4.0732 |

Random | 15, 7, 2 | 2.9340 |

Random | 2, 4, 13 | 3.3720 |

Random | 10, 11, 6 | 3.4966 |

Random | 2, 10, 8 | 3.5630 |

Random | 5, 7, 1 | 3.7798 |

Random | 7, 9, 15 | 3.8290 |

Random | 1, 9, 14 | 3.8511 |

Random | 2, 9, 4 | 3.8528 |

Random | 11, 10, 1 | 3.8570 |

Random | 12, 7, 2 | 3.8730 |

Min. triplet | 15, 5, 7 | 2.4758 |

To demonstrate the greedy and random selection algorithms, we used the same data. Table 1 includes the results. The entropy of the maximal triplet of sensors can be compared to that of random selections of triplets. Note that since many sensors are spread in a relatively small aria, there are several triplets which include an amount of information very close to the maximal (for a triplet). To get a sense of how correlated sensors can be, the entropy of a minimal triplet (also achieved by a greedy algorithm) is also depicted.

We also demonstrate the random sensor selection algorithm on artificial data. To do this, we artificially created randomized data for 5 independent sensors, and used them to create 5 additional depend ones, which are a function of the original sensor. Sensors with even numbers are independent of each other, while sensors with odd number are linearly dependent on the even number sensors. Note that this is a very simplified model, which is included here only to demonstrate in practice the number of rounds the random selection algorithm requires in order to find an independent set. Furthermore, note that sensors depending on others and additional data may still be independent of each other, depending on the other sensors in the group. For example, if and are independent bits (with entropy each), and , then and are still independent, with joint entropy , while the three are dependent, with joint entropy as well.

The algorithm then chose sets of sensors at random. Entropy estimates of the selected sensors are computed according to the joint first order probability estimate, that is, , where is the data for the five selected sensors. It is easy to see from Table 2 that independent sensors were drawn very fast, with out of trials succeeding.

Draw Number | Entropy Estimate | Draw Number | Entropy Estimate |
---|---|---|---|

1 | 3.9938 | 11 | 3.9899 |

2 | 3.9938 | 12 | 2.9966 |

3 | 3.9938 | 13 | 4.9829 |

4 | 3.9915 | 14 | 3.9938 |

5 | 3.9938 | 15 | 3.9938 |

6 | 2.9970 | 16 | 4.9829 |

7 | 1.9976 | 17 | 4.9829 |

8 | 4.9829 | 18 | 2.9966 |

9 | 3.9895 | 19 | 2.9943 |

10 | 3.9938 | 20 | 3.9938 |

## Appendix A Proof of Proposition 6

We follow the analysis of exponential weighing, similar to Merhav et al. (2002). A similar analysis was also used in Cohen et al. (2007). In our setting, however, there is no notion of block size (so one can assume data is processed in blocks of size ).

For some define

and let the probability distribution assigned by the algorithm be

(5) |

We have

(6) | |||||

Moreover,

where the last inequality follows from assuming the distance function is bounded by some , hence is in the range , and the extension to Hoeffding’s inequality given in Merhav et al. (2002), which asserts that for any random variable taking values in a bounded interval of size and mean we have

Thus,

(8) | |||||

where the expectation in (8) is with respect to the randomized choices the algorithm takes. Finally, from (6) and (8), we have, for any sequence ,

(9) |

Since is any non-negative parameter, we may optimize the right hand side of (9) with respect to . The proposition follows by choosing

## References

- Slepian and Wolf (1973) D. Slepian, J. Wolf, Noiseless coding of correlated information sources, IEEE Trans. Inform. Theory 19 (1973) 471–480.
- Ziv and Lempel (1978) J. Ziv, A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Trans. Inform. Theory IT-24 (1978) 530–536.
- Zozor et al. (2005) S. Zozor, P. Ravier, O. Buttelli, On lempel-ziv complexity for multidimensional data analysis, Physica A: Statistical Mechanics and its Applications 345 (2005) 285–302.
- Blanc et al. (2008) J. Blanc, N. Schmidt, L. Bonnier, L. Pezard, A. Lesne, Quantifying neural correlations using lempel-ziv complexity, in: Neurocomp.
- Hall and Llinas (1997) D. Hall, J. Llinas, An introduction to multisensor data fusion, Proceedings of the IEEE 85 (1997) 6–23.
- Sasiadek (2002) J. Sasiadek, Sensor fusion, Annual Reviews in Control 26 (2002) 203–228.
- Waltz (1986) E. Waltz, Data fusion for c3i: A tutorial, Command, Control, Communications Intelligence (C3I) Handbook (1986) 217–226.
- Karger (1993) D. Karger, Random sampling in matroids, with applications to graph connectivity and minimum spanning trees, in: Foundations of Computer Science, 1993. Proceedings., 34th Annual Symposium on, IEEE, pp. 84–93.
- Berger et al. (2004) F. Berger, P. Gritzmann, S. de Vries, Minimum cycle bases for network graphs, Algorithmica 40 (2004) 51–62.
- Vovk (1990) V. G. Vovk, Aggregating strategies, Proc. 3rd Annu. Workshop Computational Learning Theory, San Mateo, CA (1990) 372–383.
- Jeon and Landgrebe (1999) B. Jeon, D. Landgrebe, Decision fusion approach for multitemporal classification, Geoscience and Remote Sensing, IEEE Transactions on 37 (1999) 1227–1233.
- Yu and Sycara (2006) B. Yu, K. Sycara, Learning the quality of sensor data in distributed decision fusion, in: Information Fusion, 9th International Conference on, IEEE, pp. 1–8.
- Polikar et al. (2006) R. Polikar, D. Parikh, S. Mandayam, Multiple classifier systems for multisensor data fusion, in: Sensors Applications Symposium, 2006. Proceedings of the 2006 IEEE, pp. 180–184.
- Oxley (1992) J. G. Oxley, Matroid Theory, Oxford Univ. Press, Oxford, U.K., 1992.
- Yeung (2002) R. W. Yeung, A First Course in Information Theory, Springer, 2002.
- Chan and Grant (2008) T. H. Chan, A. Grant, Dualities between entropy functions and network codes, IEEE Trans. Inform. Theory 54 (2008) 4470–4487.
- Matúš (2007) F. Matúš, Two constructions on limits of entropy functions, IEEE Trans. Inform. Theory 53 (2007) 320–330.
- Cover and Thomas (2006) T. Cover, J. Thomas, Elements of information theory, Wiley, 2006.
- Savari (1997) S. Savari, Redundancy of the lempel-ziv incremental parsing rule, Information Theory, IEEE Transactions on 43 (1997) 9–21.
- Lezaud (1998) P. Lezaud, Chernoff-type bound for finite markov chains, The Annals of Applied Probability 8 (1998) 849–867.
- Shamaiah et al. (2010) M. Shamaiah, S. Banerjee, H. Vikalo, Greedy sensor selection: Leveraging submodularity, in: Decision and Control (CDC), 49th IEEE Conference on, pp. 2572–2577.
- Nemhauser and Wolsey (1978) G. Nemhauser, L. Wolsey, Best algorithms for approximating the maximum of a submodular set function, Mathematics of Operations Research (1978) 177–188.
- Littlestone and Warmuth (1994) N. Littlestone, M. K. Warmuth, The weighted majority algorithm, Inform. Comput. 108 (1994) 212–261.
- Cesa-Bianchi et al. (1997) N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, M. K. Warmuth, How to use expert advice, Journal of the ACM 44(3) (1997) 427–485.
- Haussler et al. (1998) D. Haussler, J. Kivinen, M. K. Warmuth, Sequential prediction of individual sequences under general loss functions, IEEE Trans. on Information Theory 44 (1998) 1906–1925.
- Cesa-Bianchi and Lugosi (1999) N. Cesa-Bianchi, G. Lugosi, On prediction of individual sequences, The Annals of Statistics 27 (1999) 1865–1895.
- Gyorfi et al. (1999) L. Gyorfi, G. Lugosi, G. Morvai, A simple randomized algorithm for sequential prediction of ergodic time series, IEEE Trans. Inform. Theory 45 (1999) 2642–2650.
- Vovk (1998) V. Vovk, A game of prediction with expert advice, Journal of Computer and System Sciences 56 (1998) 153–173.
- Linder and Lugosi (2001) T. Linder, G. Lugosi, A zero-delay sequential scheme for lossy coding of individual sequences, IEEE Trans. Inform. Theory 47 (2001) 2533–2538.
- Weissman and Merhav (2002) T. Weissman, N. Merhav, On limited-delay lossy coding and filtering of individual sequences, IEEE Trans. Inform. Theory 48 (2002) 721–733.
- Gyorgy et al. (2004) A. Gyorgy, T. Linder, G. Lugosi, Efficient adaptive algorithms and minimax bounds for zero-delay lossy source coding, IEEE Trans. Signal Processing 52 (2004) 2337–2347.
- Merhav et al. (2002) N. Merhav, E. Ordentlich, G. Seroussi, M. J. Weinberger, On sequential strategies for loss functions with memory, IEEE Trans. Inform. Theory 48 (2002) 1947–1958.
- Cohen et al. (2007) A. Cohen, N. Merhav, T. Weissman, Scanning and sequential decision making for multi-dimensional data - part I: the noiseless case, IEEE Trans. Inform. Theory 53 (2007) 3001–3020.
- Cohen et al. (2008) A. Cohen, T. Weissman, N. Merhav, Scanning and Sequential Decision Making for Multidimensional Data - Part II: The Noisy Case, IEEE Transactions on Information Theory 54 (2008) 5609–5631.