Monitoring of Domain-Related Problems in Distributed Data StreamsThis work was partially supported by the German Research Foundation (DFG) within the Priority Program “Algorithms for Big Data” (SPP 1736) and by the Federal Ministry of Education and Research (BMBF) as part of the poject “Resilience by Spontaneous Volunteers Networks for Coping with Emergencies and Disaster” (RESIBES), (grant no 13N13955 to 13N13957).

# Monitoring of Domain-Related Problems in Distributed Data Streams1

## Abstract

Consider a network in which distributed nodes are connected to a single server. Each node continuously observes a data stream consisting of one value per discrete time step. The server has to continuously monitor a given parameter defined over all information available at the distributed nodes. That is, in any time step , it has to compute an output based on all values currently observed across all streams. To do so, nodes can send messages to the server and the server can broadcast messages to the nodes. The objective is the minimisation of communication while allowing the server to compute the desired output.

We consider monitoring problems related to the domain defined to be the set of values observed by at least one node at time . We provide randomised algorithms for monitoring , (approximations of) the size and the frequencies of all members of . Besides worst-case bounds, we also obtain improved results when inputs are parameterised according to the similarity of observations between consecutive time steps. This parameterisation allows to exclude inputs with rapid and heavy changes, which usually lead to the worst-case bounds but might be rather artificial in certain scenarios.

\crefname

stepStepSteps \creflabelformatstep#2#1#3

## 1 Introduction

Consider a system consisting of a huge amount of nodes such as a distributed sensor network. Each node continuously observes its environment and measures information such as temperature, pollution or similar parameters. Given such a system, we are interested in aggregating information and continuously monitoring properties describing the current status of the system at a central server. To keep the server’s information up to date, the server and the nodes can communicate with each other. In sensor networks, however, the amount of such communication is particularly crucial, as communication translates to energy consumption, which determines the overall lifetime of the network due to limited battery capacities. Therefore, algorithms aim at minimizing the communication required for monitoring the respective parameter at the server.

One very basic parameter is the domain of the system defined to be the values currently observed across all nodes. We consider different notions related to the domain and propose algorithms for monitoring the domain itself, (approximations of) its size and (approximations of) the frequencies of values comprising the domain, respectively. Each of these parameters can provide useful information, e.g. the information about the (approximated) frequency of each value allows to approximate very precisely the histogram of the observed values, and this allows to determine (approximations of) several functions of the input, e.g. heavy hitters, quantiles, top-, frequency moments or threshold problems.

### 1.1 Model and Problems

We consider the continuous distributed monitoring setting, introduced by Cormode, Muthukrishnan, and Yi in [1], in which there are distributed nodes, each uniquely identified by an identifier (ID) from the set , connected to a single server. Each node observes a stream of values over time and at any discrete time step node observes one value . The server is asked to, at any point in time, compute an output which depends on the values (for , and ) observed across all distributed streams up to the current time step . The exact definition of depends on the concrete problems under consideration, which are defined in the section below. For the solution of these problems, we are usually interested in approximation algorithms. An -approximation of is an output of the server such that . We call an algorithm that, for each time step, provides an -approximation with probability at least , an -approximation algorithm. To be able to compute the output, the nodes and the server can communicate with each other by exchanging single cast messages or by broadcast messages sent by the server and received by all nodes. Both types of communication are instantaneous and have unit cost per message. That is, sending a single message to one specific node incurs cost of one and so does one broadcast message. Each message has a size of bits and will usually, besides a constant number of control bits, consist of a value from , a node ID and an identifier to distinguish between messages of different instances of an algorithm applied in parallel (as done when using standard probability amplification techniques). Having a broadcast channel is an extension to [1], which was originally proposed in [2] and afterwards applied in [7, 8]. For ease of presentation, we assume that not only the server can send broadcast messages, but also the nodes. This changes the communication cost only by a factor of at most two, as a broadcast by a node can always be implemented by a single cast message followed by a broadcast of the server. Between any two time steps we allow a communication protocol to take place, which may use polylogarithmic rounds, for some constant . The optimisation goal is the minimisation of the communication cost, given by the number of exchanged messages, required to monitor the considered problem.

#### Monitoring of Domain-Related Functions.

In this paper, we consider the monitoring of different problems related to the domain of the network. The domain at time is defined as , the set of values observed by at least one node at time . We study the following three problems related to the domain:

[noitemsep]
• Domain Monitoring. At any point in time, the server needs to know the domain of the system as well as a representative node for each value of the domain. Formally, monitor , at any point in time. Also, maintain a sequence of nodes such that for all observed values a representative is determined with and . For each value which is not observed, no representative is given and .

• Frequency Monitoring. For each monitor the frequency of nodes in that observed at , i.e. the number of nodes currently observing .

• Count Distinct Monitoring. Monitor , i.e. the number of distinct values observed at time t.

We provide an exact algorithm for the Domain Monitoring Problem and -approximations for the Frequency and Count Distinct Monitoring Problem.

### 1.2 Our Contribution

For the Domain Monitoring Problem, an algorithm which uses messages on expectation for time steps is given in \crefse:domain. This is asymptotically optimal in the worst-case in which holds for all . We also provide an algorithm and an analysis based on the minimum possible number of changes of representatives for a given input. It exploits situations where and uses messages on expectation.

For an (,)-approximation of the Frequency Monitoring Problem for time steps, we first provide an algorithm using messages on expectation in \crefsec:frequencies. We then improve this bound for instances in which observations between consecutive steps have a certain similarity. That is, for inputs fulfilling the property that for all and some , the number of nodes observing does not change by a factor larger than between consecutive time steps, we provide an algorithm that uses an expected amount of messages. In \crefsec:countDistinct, we provide an algorithm using messages on expectation for the Count Distinct Monitoring Problem for time steps. For instances which exhibit a certain similarity an algorithm is presented which monitors the problem using messages on expectation.

### 1.3 Related Work

The basis of the model considered in this paper is the continuous monitoring model as introduced by Cormode, Muthukrishnan and Yi in [1]. In this model, there is a set of distributed nodes each observing a stream given by a multiset of items in each time step. The nodes can communicate with a central server, which in turn has the task to continuously, at any time , compute a function defined over all data observed across all streams up to time . The goal is to design protocols aiming at the minimisation of the number of bits communicated between the nodes and the server. In [1], the monitoring of several functions is studied in their (approximate) threshold variants, in which the server has to output if and if , for given and . Precisely, algorithms for the frequency moments where denotes the frequency of item for are given. represents the simple sum of all items received so far and the number of distinct items received so far. Since the introduction of the model, monitoring of several functions has been studied such as the monitoring of frequencies and ranks by Huang, Yi and Zhang in [5]. The frequency of an item is defined to be the number of occurrences of across all streams up to the current time. The rank of an item is the number of items smaller than observed in the streams. Frequency moments for any are considered by Woodruff and Zhang in [9]. A variant of the Count Distinct Monitoring Problem is considered by Gibbons and Tirthapura in [4]. The authors study a model in which each of two nodes receives a stream of items and at the end of the streams a server is asked to compute based on both streams. A main technical ingredient is the use of so called public coins, which, once initialized at the nodes, provide a way to let different nodes observe identical outcomes of random experiments without further communication. We will adopt this technique in \crefsec:countDistinct. Note that the previously mentioned problems are all defined over the items received so far, which is in contrast to the definition of monitoring problems which we are going to consider and which are all defined only based on the current time step. This fact has the implication that in our problems the monitored functions are no longer monotone, which makes its monitoring more complicated.

Concerning monitoring problems in which the function tracked by the server only depends on the current time step, there is also some previous work to mention. In [6], Lam, Liu and Ting study a setting in which the server needs to know, at any time, the order type of the values currently observed. That is, the server needs to know which node observes the largest value, second largerst value and so on at time . In [10], Yi and Zhang consider a system only consisting of one node connected to the server. The node continuously observes a -dimensional vector of integers from . The goal is to keep the server informed about this vector up to some additive error per component. In [3], Davis, Edmonds and Impagliazzo consider the following resource allocation problem: nodes observe streams of required shares of a given resource. The server has to assign, to each node, in each time step, a share of the resource that is as least as large as the required share. The objective is then given by the minimization of communication necessary for adapting the assignment of the resource over time.

## 2 The Domain Monitoring Problem

We start by presenting an algorithm to solve the Domain Monitoring Problem for a single time step. We analyse the communication cost using standard worst-case analysis and show tight bounds. By applying the algorithm for each time step, we then obtain tight bounds for monitoring the domain for any time steps. The basic idea of the protocol as given in \crefalg:p1 is quite simple: Applied at a time with a value , the server gets informed whether holds or not. To do so, each node with essentially draws a value from a geometric distribution and then those nodes having drawn the largest such value send broadcast messages. By this, one can show that on expectation only a constant number of messages is sent.

Furthermore, if applied with , the server can decide whether for all at once with messages on expectation. To this end, for each independently, the nodes with drawing the largest value from the geometric distribution send broadcast messages. In the presentation of \crefalg:p1, we assume that is always true if . Also, in order to apply it to a subset of nodes, we assume that each node maintains a value and only nodes take part in the protocol for which holds.

We have the following lemma, which bounds the expected communication cost of \crefalg:p1 and has already appeared in a similar way in [8] (Lemma III.1).

###### Lemma 2.1.

Applied for a fixed time , ConstantResponse() uses messages on expectation if and otherwise.

###### Proof.

First consider the case where . Regarding the expected communication of ConstantResponse() we introduce some notation. Let be a -random variable indicating whether the node sends a message to the server, and . According to the algorithm a sensor sends a message if and only if its height matches the round specified for that height and no other sensor has sent its value beforehand. We obtain

 Pr[Xi=1] =Pr[∃r∈{1,…,logn}:hi=r∧∀i′∈Nvt∖{i}:hi′≤r] ≤logn∑r=112r(1−12r)nv−1.

We know that and thus

 E[X]≤nv⋅logn∑r=112r(1−12r)nv−1.

Observing that has only one extreme point and for all , we use the integral test for convergence to obtain

 E[X] ≤nv⋅logn∑r=112r(1−12r)nv−1≤nv∫logn012r(1−12r)nv−1dr+2 ≤[1ln(2)(1−12r)nv]logn0+2≤1ln(2)+2<4.

For the case we can apply the same argumentation independently for each value . This concludes the proof of the lemma. ∎

In order to solve the domain monitoring problem for time steps, the server proceeds as follows: In each step the server calls ConstantResponse( to identify all values belonging to as well as a valid sequence . By the previous lemma we then have an overall communication cost of for each time step . For monitoring time steps, the cost is . This is asymptotically optimal in the worst-case since on instances where for all , any algorithm has cost .

###### Theorem 2.2.

Using ConstantResponse(), the Domain Monitoring Problem for time steps can be solved using messages on expectation.

### A Parameterised Analysis

Despite the optimality of the result, the strategy of computing a new solution from scratch in each time step seems unwise and the analysis does not seem to capture the essence of the problem properly. It often might be the case that there are some similarities between values observed in consecutive time steps and particularly, that . In this case, there might be the chance to keep a representative for several consecutive time steps, which should be exploited. Due to these observations we next define a parameter describing this behavior and provide a parameterised analysis. To this end, we consider the number of component-wise differences in the sequences of nodes and and call this difference the number of changes of representatives in time step . Let denote the minimum possible number of changes of representatives (over all considered time steps ). The formal description of our algorithm is given in \crefalg:domainMonitoring. Roughly speaking, the algorithm defines, for each value , phases, where a phase is defined as a maximal time interval during which there exists one node observing value throughout the entire interval. Whenever a node being a representative for changes its observation, it informs the server so that a new representative can be chosen (from those observing throughout the entire phase, which is indicated by ). If no new representative is found this way, the server tries to find a new representative among those observing and for which and ends the current phase. Additionally, if a node observes a value at time for which , a new representative is determined among these nodes. Note that this requires each node to store at any time and hence a storage of .

###### Theorem 2.3.

DomainMonitoring as described in \crefalg:domainMonitoring solves the Domain Monitoring Problem using messages on expectation, where denotes the minimum possible number of changes of representatives.

###### Proof.

We consider each value separately. Let denote the set of nodes that observe the value at each point in time with . Consider a fixed phase for and let and be the points in time where the phase starts and ends, respectively. A phase only ends in \crefstep:removed, hence there was no response from ConstantResponse, which implies . Thus, to each phase for we can associate a cost of at least one to and this holds for each . Therefore, is at least the overall number of phases of all values.

Next we analyze the expected cost of \crefalg:domainMonitoring during the considered phase for . Let w.l.o.g. . With respect to the fixed phase, only nodes in can communicate and the communication is bounded by the number of changes of the representative for during the phase. Let be the first time after at which node does not observe . Let the nodes be sorted such that implies . Let be the nodes \crefalg:domainMonitoring chooses as representatives in the considered phase. We want to show that . To this end, partition the set of time steps into groups . Intuitively, represents the time steps in which the nodes continuously observe value since time and the size of the initial set of nodes that observed is halved times. Formally, contains all time steps (where for convenience) such that is the largest integer fulfilling .

Let be the number of changes of representatives in time steps belonging to . We have . Consider a fixed . Let be the event that the -th representative chosen in time steps belonging to is the first one with an index in . Observe that as soon as this happens, the respective representative will be the last one chosen in a time step belonging to group .

Now, since the algorithm chooses a new representative uniformly at random from the index set , the probability that it chooses a representative from is at least except for the first representative of , where it might be slightly smaller due to rounding errors. occurs only if the first representatives were each not chosen from this set, i.e. . Hence, . ∎

## 3 The Frequency Monitoring Problem

In this section we design and analyse an algorithm for the Frequency Monitoring Problem, i.e. to output (an approximation) of the number of nodes currently observing value . We start by considering a single time step and present an algorithm which solves the subproblem to output the number of nodes that observe within a constant multiplicative error bound. Afterwards, and based on this subproblem, a simple sampling algorithm is presented which solves the Frequency Monitoring Problem for a single time step up to a given (multiplicative) error bound and with demanded error probability.

While in the previous section we used the algorithm ConstantResponse with the goal to obtain a representative for a measured value, in this section we will use the same algorithm to estimate the number of nodes that measure a certain value . Observe that the expected maximal height of the geometric experiment increases with a growing number of nodes observing . We exploit this fact and use it to estimate the number of nodes with value , while still expecting constant communication cost only. For a given a time step and a value , we define an algorithm ConstantFactorApproximation as follows: We apply ConstantResponse() with for all nodes . If the server receives the first response in communication round , the algorithm outputs as the estimation for .

We show that we compute a constant factor approximation with constant probability. Then we amplify this probability using multiple executions of the algorithm and taking the median (of the executions) as a final result.

###### Lemma 3.1.

The algorithm ConstantFactorApproximation estimates the number of nodes observing the value at time up to a factor of , i.e.  with constant probability.

###### Proof.

Let be the number of nodes currently observing value , i.e. . Recall that the probability for a single node to draw height is , if , and , if . Hence, for all .

We estimate the probability of the algorithm to fail, by analysing the cases that is larger than or smaller than . We start with the first case and by applying a union bound we obtain:

 Pr[∃i:hi>lognv+3] ≤Pr[∃i:hi≥⌈lognv⌉+3] =nv⋅(12)⌈lognv⌉+2≤14.

For the latter case we bound the probability that each node has drawn a height strictly smaller than by

 Pr[∀i:hi

Thus, the probability that we compute an 8-approximation is bounded by

 Pr[nv8≤2hi≤8nv] =1−(Pr[∃i:hi>lognv+3]+Pr[∀i:hi0.7

We apply an amplification technique to boost the success probability to arbitrary using parallel executions of the ConstantFactorApproximation algorithm and choose the median of the intermediate results as the final output.

###### Corollary 3.2.

Applying independent, parallel instances of ConstantFactorApproximation, we obtain a constant factor approximation of with success probability at least using messages on expectation.

###### Proof.

Choose to be the number of copies of the algorithm and return the median of the intermediate results. Let be the indicator variable for the event that the -th experiment does not result in an 8-approximation. By \creflemma:constant_factor_approximation the failure probability can be upper bounded by a constant, i.e. . Hence, using a Chernoff bound, the probability that at least half of the experiments do meet the required approximation factor of is

 Pr[d∑j=1Ij≥12d] ≤Pr[d∑j=1Ij≥(1+23)⋅0.3⋅d] ≤e−(23)2⋅13⋅0.3⋅d=e−245⋅d=e−245⋅452ln1δ′=δ′.

Observe that if at least half of the intermediate results are within the demanded error bound, so is the median. Thus, the algorithm produces an -approximation of with success-probability of at least , concluding the proof. ∎

To obtain an -approximation, in \crefalg:epsilon_factor_approximation we first apply the ConstantFactorApproximation algorithm to obtain a rough estimate of . It is used to compute a probability , which is broadcasted to the nodes, so that every node observing value sends a message with probability . Since the ConstantFactorApproximation result in the denominator of is close to , the number of messages sent on expectation is independent of . The estimated number of nodes observing is then given by the number of responding nodes divided by , which, on expectation, results in .

###### Lemma 3.3.

The algorithm EpsilonFactorApprox as given in \crefalg:epsilon_factor_approximation provides an (,)-approximation of .

###### Proof.

The algorithm obtains a constant factor approximation with probability . The expected number of messages is .

We start by estimating the conditional probability that more than responses are sent under the condition that and . In this case we have

 p=24ε2~nvconst⋅ln1δ′≥3ε2nv⋅ln1δ′,

hence using a Chernoff bound it follows

 Missing or unrecognized delimiter for \left

Likewise the probability that less than messages are sent under the condition that and is

 p2 \coloneqqPr[¯nv≤(1−ε)nvp∣∣~nvconst≤8nv∧p<1] ≤e−ε22nv⋅3ε2nv⋅ln1δ′≤e−32ln1δ′<δ′.

Next consider the case that and holds. Using

 Pr[~nvconst>8nv]≤Pr[~nvconst>8nv∨~nvconst

and for ,

 Pr[(1−ε)nvp<¯nv<(1+ε)nvp|p<1] ≥1−(Pr[~nvconst>8nv]+(p1+p2))≥1−3δ′=1−δ.

For the last case , we have by using . Now, directly follows. ∎

###### Lemma 3.4.

Algorithm EpsilonFactorApprox as given in \crefalg:epsilon_factor_approximation uses messages on expectation.

###### Proof.

Recall that each of the nodes sends a message with probability , leading to messages on expectation. First assume that the constant factor approximation was successful, i.e. . If , we have

 nv⋅p=nv24ε2~nvconst⋅ln1δ′≤24⋅8ε2⋅ln1δ′=Θ(1ε2log1δ).

If , by definition , hence . Thus, .

For the case that the constant factor approximation was not successful, note that holds analogously to the calculation in \creflemma:constant_factor_approximation. Also, for and , we have

 nvp≤8⋅2i⋅~nvconst⋅24ε2~nvconst⋅ln1δ=2i⋅Θ(1ε2log1δ).

Similarly, for , we have as in this case, . Hence, we can conclude

 E[¯nv] ≤Θ(1ε2log1δ)⋅Pr[~nvconst≥18nv] +∞∑i=0Pr[18⋅2i+1nv≤~nvconst<18⋅2inv]⋅2i+1⋅Θ(1ε2log1δ)
 ≤Θ(1ε2log1δ)(1+∞∑i=02i+1e2i+3)≤Θ(1ε2log1δ)(1+∞∑i=02i+1−2i+3) ≤Θ(1ε2log1δ)(1+∞∑i=02−i)=Θ(1ε2log1δ).

###### Theorem 3.5.

There exists an algorithm that provides an (,)-approximation for the Frequency Monitoring Problem for time steps with an expected number of messages.

###### Proof.

In every time step we first identify by applying ConstantResponse using messages on expectation. On every value we then perform algorithm EpsilonFactorApprox(,,), resulting in an amount of messages on expectation for a single time step, while achieving a probability (using a union bound) of that in one time step the estimations for every are -approximations. Applied for each of the time steps, we obtain a bound as claimed. ∎

### A Parameterised Analysis

Applying EpsilonFactorApprox in every time step is a good solution in worst case scenarios. But if we assume that the change in the set of nodes observing a value is small in comparison to the size of the set, we can do better.

We extend the EpsilonFactorApprox such that in settings where from one time step to another only a small fraction of nodes change the value they measure, the amount of communication can be reduced, while the quality guarantees remain intact. We define such that

 ∀t:σ≥|Nvt−1∖Nvt|+|Nvt∖Nvt−1||Nvt|.

Note that this also implies that holds for all time steps , i.e. the set of measured values stays the same over time.

The extension is designed so that compared to EpsilonFactorApprox, also in settings with many changes the solution quality and message complexity asymptotically does not increase. The idea is the following: For a fixed value , in a first time step EpsilonFactorApprox is executed (defining a probability in \crefsetp of \crefalg:epsilon_factor_approximation). In every following time step, up to consecutive time steps, nodes that start or stop measuring a value send a message to the server with the same probability , while nodes that do not observe a change in their value remain silent. In every time step , the server uses the accumulated messages from the first time step and all messages from nodes that started measuring in time steps , while subtracting all messages from nodes that stopped measuring in the time steps . This accumulated message count is then used similarly as in EpsilonFactorApprox to estimate the total number of nodes observing in the current time step. The algorithm starts again if a) time steps are over, so that the probability of a good estimation remains good enough, or b) the sum of estimated nodes to start/stop measuring value is too large. The latter is done to ensure that the message probability remains fitting to the number of nodes, ensuring a small amount of communication, while guaranteeing an -approximation.

Let be the number of nodes that start measuring in time step or that stop measuring it, respectively, i.e. , and and the number of them that sent a message to the server in time step . In the following we call nodes contributing to and entering and leaving, respectively.

###### Lemma 3.6.

For any , the algorithm ContinuousEpsilonApprox provides an (,)-approximation of .

###### Proof.

By the same arguments as in \creflemma:single_shot_epsilon, we obtain an (,)-approximation of . In any further time step we compute our estimate over the sum of all received messages (, arrivals and departures). If too many nodes change their measured value, we redo a complete estimation of the nodes in .

Recall that is the random variable giving the estimated number of nodes by the algorithm, and are the random variables giving the estimated arrivals and departures in that time step. We look at any time step where the restart criteria are not met: Since and the linearity of expectation, for any time we can use a Chernoff bound as in \creflemma:single_shot_epsilon to show that the estimation is an -approximation.

Using a union bound on the fail probability of up to time steps, we get a probability of having a correct estimation in any time step. ∎

###### Lemma 3.7.

For a fixed value and , , time steps, ContinuousEpsilonApprox uses messages on expectation.

###### Proof.

The message complexity depends on the initial size and on the number of nodes leaving and entering in those time steps, which is bounded by . If EpsilonFactorApprox obtained a correct probability in \crefalg:ContEpsApprox:1, i.e. , the expected number of messages (in case ) is

 E[T′∑t=1¯nt∣∣ ∣∣p=Θ(1n1)] =E[¯n1+T′∑i=2¯n+i+¯n−i∣∣ ∣∣p=Θ(1n1)] =(n1+T′∑i=2n+i+n−i)p≤(n1+T′σn1)p =n1(1+T′σ)⋅24⋅1ε2~nvconstln1δ′ =Θ((1+min{12σ,1δ}σ)⋅1/ε2log1δ).

Considering the case where EpsilonFactorApprox estimated wrong, the message complexity could increase greatly if the probability is too large for the actual number of nodes (i.e. an underestimation leads to high message complexity). But the probability to misestimate by some constant factor (which would increase the message complexity by that factor) decreases exponentially in this factor (as shown in \creflemma:msg_epsilonApprox for EpsilonFactorApprox), leaving the expected number of messages to be . ∎

###### Theorem 3.8.

There exists an (,)-approximation algorithm for the Frequency Monitoring Problem for consecutive time steps which uses an amount of messages on expectation, if .

###### Proof.

The algorithm works by first applying ConstantResponse(,) to obtain and then applying ContinuousEpsilonApprox(, , ) for every . By \creflemma:frequencies_multiple_step_correctness we know that in every time step and for all , the frequency of is approximated up to a factor of with probability . We divide the time steps into intervals of size and perform ContinuousEpsilonApprox on each of them for every value . There are such intervals. For each of those, by \creflemma:frequencies_multiple_step_complexity we need messages on expectation for each . This yields a complexity of due to . Using a union bound over the fail probability for every , a success probability of at least follows. ∎

By \creftheorem:epsilon_factor_approximation, trivially repeating the single step algorithm EpsilonFactorApprox needs messages on expectation for (because the number of nodes in