Hashing Pursuit for Online Identification of Heavy-Hitters in High-Speed Network Streams

Hashing Pursuit for Online Identification of Heavy-Hitters in High-Speed Network Streams


Distributed Denial of Service (DDoS) attacks have become more prominent recently, both in frequency of occurrence, as well as magnitude. Such attacks render key Internet resources unavailable and disrupt its normal operation. It is therefore of paramount importance to quickly identify malicious Internet activity. The DDoS threat model includes characteristics such as: (i) heavy-hitters that transmit large volumes of traffic towards “victims”, (ii) persistent-hitters that send traffic, not necessarily large, to specific destinations to be used as attack facilitators, (iii) host and port scanning for compiling lists of un-secure servers to be used as attack amplifiers, etc. This conglomeration of problems motivates the development of space/time efficient summaries of data traffic streams that can be used to identify heavy-hitters associated with the above attack vectors. This paper presents a hashing-based framework and fast algorithms that take into account the large-dimensionality of the incoming network stream and can be employed to quickly identify the culprits. The algorithms and data structures proposed provide a synopsis of the network stream that is not taxing to fast-memory, and can be efficiently implemented in hardware due to simple bit-wise operations. The methods are evaluated using real-world Internet data from a large academic network.

I Introduction

Distributed Denial of Service attacks have become prominent recently both in frequency of occurrence as well as magnitude [1]. The detection and identification of such nefarious Internet activity are key problems for network engineers. The time scale at which such attacks are detected and identified is of crucial importance. In practice, this time scale, referred here as the relevant time scale (RTS), should be on the order of seconds or minutes rather than hours. However, processing data in a streaming fashion for network monitoring that aims to detect and identify network anomalies poses two fundamental computing challenges. First, one needs to work with “small-space” data structures; storing snapshots of the incoming data stream in fast memory is prohibitively expensive. Second, any data processing on the incoming network stream ought to be performed efficiently; expensive and time-consuming computations on voluminous streams may defeat the purpose of real or near-real time network monitoring.

In addition to the need for rapid RTS detection and identification of malicious activities, another important feature is their growing sophistication. Attacks in complex modern networks are often distributed and coordinated [1]. For example, the sources involved in a DDoS attack may be spread through various sub-networks and individually may not stand out as heavy traffic generators. Therefore, detection of the attack and identification of its victim(s) is only possible if traffic is monitored simultaneously over multiple sites on the network. The rapid communication (on the RTS scale) of anomaly signatures from monitoring sites to a single decision center imposes stringent size constraints on the data structures involved.

In recent years, numerous sophisticated and accurate methods for anomaly detection have been proposed. For example, signature-based methods examine traffic/data for known malicious patterns to detect malware (worms, botnets, etc); Snort (see, snort.org) and Bro (bro.org) are two well-known tools of this class. On the other hand, a plethora of methods have appeared in the literature that look for deviations from “normality” (e.g., see [2, 3, 4, 5] to name a few, and the review paper [6] for a more exhaustive list) and do not require a priori knowledge of attack patterns. Most of them, however, require heavy computation (e.g., singular value decomposition of large matrices or computation of wavelet coefficients) and/or storage and post-processing of historical traffic traces. This makes them essentially not applicable in practice on the RTS. As a consequence, much attention was paid to alleviating the high dimensionality constraint of the problem. Algorithms that involve memory efficient data structures, called “sketches”, that exhibit sub-linear computational efficiency with respect to the input space have been well-studied by the community [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. These summary data structures can accurately estimate the state of the measured signal while having a low computation and space fingerprint. However, with the exception of the work in [12], where a hash-based algorithm and fast hardware implementation is introduced, there is a gap between the theoretical optimality of advanced sub-linear algorithms and their practical implementation.

This paper aims to bridge this gap and its key contributions include: (i) the development of space and time efficient algorithms and data structures that can be continuously and rapidly updated while monitoring fast traffic streams. The proposed data structures, based on permutation hashes, allow us to identify the heavy-hitters in a traffic stream (i.e., the IPs or other keys responsible for the, say, top- signal values in a window of interest). Our algorithms are memory efficient and require constant memory space that is much smaller than the dimension of the streaming data. Further, all computations involve fast bit-wise operations making them amenable for fast hardware implementation; (ii) we propose a framework suited for different types of traffic signals in a unified “hashing pursuit” perspective. For example, the signal of interest can be conventional traffic volume (measured in packets or bytes) or the number of different source IPs that have accessed a given destination, etc. In the latter case (upon filtering out the well-known benign nodes/users/networks) the heavy hitters correspond to potential victims of DDoS attacks. Their rapid identification (on the RTS) for the purpose of mitigation is of utmost importance to network security; (iii) we evaluate our algorithms with real-world networking data collected at a large academic ISP in the United States.

Ii Streaming Paradigm

The theoretical framework of computation on data streams is best suited to our needs [18]. In this context, a traffic link can be viewed as a stream of items that are seen only once as they pass through a monitoring station (e.g. traffic router). Due to space/time constraints, the entire stream cannot be recorded and hence only fast small memory sequential algorithms can be used for its characterization and analysis. The are the keys and are the updates (e.g., payload increments) of the stream. For example, the set of keys could be all IPv4 addresses of traffic sources and the payloads could include byte content, packets, port, protocol, or other pertinent information. Alternatively, the stream keys could be the pair of source and destination IPv4 addresses, and one could enlist many other types of keys (e.g., IPv6 addresses).

Over a monitoring period (e.g. a few seconds), the stream communicates a signal , which could take either numerical or set values. Consider the simple scalar setting where and are the number of bytes exchanged between source and destination . Upon observing the -th update , the signal is updated sequentially like a cash register1:

Thus, at the end of the monitoring period, contains the number of bytes communicated for all pairs . The “signal”, however, is only a theoretical quantity, since in practice, it cannot be fully stored and accessed on the RTS. Nevertheless, compressed representations of , known as sketches have been proposed to study its characteristics [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Broadly speaking sketches provide statistical summaries of the signal, which can be used to approximate (with high probability) features of interest. (See also the related compressed sensing paradigm [19, 20, 21, 22].)

The cash register model described above is well-suited for applications such as monitoring IP traffic entering a network link, monitoring IPs accessing a particular service (such as Web server, cloud storage, etc), monitoring particular IP pairs of a particular application (i.e., source-destination pairs of all DNS traffic), and many others.

A variation of the cash register model is obtained when the payloads ’s are sequential updates on a set. In this model, upon observing , we update the state of the signal as follows: . For example, if is the source IP and is port number () (we define the compact notation to be used hereafter), then the signal , where is the power-set of , is set-valued. For a given IP , the set consists of all different ports accessed by during the monitoring period and large values of identify potential port scanning sources . Similarly, by considering destination IPs in place of ports we can identify hosts that perform malicious activity such as horizontal host scanning.

In the above two contexts, the goal is to identify heavy–hitters, i.e. source–destination pairs generating heavy traffic, or a source with a large set of accessed ports or destinations, for example. For a signal , a heavy–hitter is formally a key

maximizing a loss function , where could be a simple byte-count, size of a set, or in a more complex, time–series setting, the frequency maximizing the power spectrum, for example.

Our goal is to develop a unified framework for the identification of heavy hitters, which can be applied to a variety of signals and that works on the relevant time scale for anomaly detection (i.e., the algorithms monitor the stream in real time and produce heavy–hitters every few seconds, minutes or every 100,000 tuples seen, etc). Depending on the type of signal, the identification part is bundled with a different specialized sketch, which allows us to handle ultra high dimensional signals. The following section describes the general methodology in terms of meta-algorithms and then proceed to several concrete applications.

Iii Hashing Pursuit for Identification

At the heart of our identification schemes lies hashing. Reversible hash functions are used to compress the domain of our incoming signal into a smaller dimension, while at the same time uniformly distributing the keys onto the reduced-dimension space. Another application of hashing used below is to efficiently ‘thin’ the original stream into sub-streams to help increase identification accuracy. We apply our hash functions on a set of keys , important special case being the set of all IPv4 numbers, where .

Hash functions for IP monitoring. Consider a set of hash functions . This set is complete (or perfect), if all are uniquely identified by their hashes , , i.e., the vector–hash is invertible. We shall describe below a particular family of rapidly computable complete hashes, which can also be quickly inverted. In particular, we will have that .

We describe next the main idea behind identification over the simple case of a scalar signal , which tracks the number of packets transmitted by IP over a monitoring period. The goal is to be able to identify the most persistent user with highest packet count.

Instead of maintaining in memory the entire signal using an array of size , we shall maintain hash–histograms , of size each. If is the heavy hitter, then it will contribute large packet counts to all of the histograms at the bins . If collisions are uniformly spread–out, with high probability, we will have that

Thus, by locating the bins maximizing each of the hash–histograms, we define


We provide lower bounds on and show that in practice the heavy hitter is identified w.h.p. (with high probability). In fact, the method naturally extends to the case of multiple heavy hitters and other types of signals. Meta-algorithms i and ii describe the general encoding and decoding steps.

0:  Set of complete uniform hashes
with . Initialize signal hash histograms:
1:  [Start] Begin stream monitoring.
2:  [Hash keys] Upon observing , compute .
3:  [Update] Compute = ,
4:  [Stop] End when the monitoring period is over.
5:  return  Output hash arrays for analysis.
Meta-Algorithm i Hash-based Encoding
0:  Set of hash histogram arrays , as outputs of Meta-algorithm i. An efficient function as in (1).
1:  [Identify] For each hash histogram array identify a bin index as a candidate bin where the heavy hitter falls.
2:  return  .
Meta-Algorithm ii Decode and Identify

The specific implementation of the hash functions and their inverses will be discussed in the following section, where the abstract ‘update’ and ‘identify’ steps in the above meta–algorithms will depend on the application.

Remark 1.

Let . The space required to maintain the raw signal , is , while the space required for the hash histograms is . These exponential savings in memory allow us in practice to rapidly communicate hash–based traffic summaries at the RTS, enabling simultaneous and distributed monitoring and identification (see Table III).

Remark 2.

The update step in Algorithm i depends on the type of signal. In the scalar case, we have

while in the set-valued case . In applications, maintaining a set, however, is often not practical either. When we are only interested in the size of the set, we can maintain a further max–stable sketch data structure, requiring an array of size for each (see Section IV-D).

Construction of complete reversible hashes. We focus on the case where the keys represent IPv4 numbers, i.e. . A simple set of complete hash functions can be constructed as follows. Let and and consider the base- expansion of :

The ’s are simply the 4 octets in the dotted-quad representation of the IP address .

Clearly, the hashes


are rapidly computable with bit–wise operations, e.g. . They are also complete since the knowledge of all base- digits determines . In practice, however, these naive hash functions are not useful. The reason is that the IPs are typically clustered (spatially correlated) since the address space used by a sub–network or autonomous system consists of contiguous ranges of IPs. Hence, one needs to permute or ‘mangle’ the IPs in the original space so that all permuted keys achieve approximately uniform empirical distributions. This can be readily resolved by composing the hashes with a permutation that can be also efficiently computed and inverted. The idea of IP mangling was employed also in [12]. Next, we describe a different kind of permutation functions, starting first with a simple general result.

Fact 1.

Equip the set of keys with the uniform probability measure. Then:

(i) The hashes in (2), viewed as random variables on , are independent and uniformly distributed.

(ii) For any permutation , the hashes are also independent and uniformly distributed.

Conversely, if is a set of uniformly distributed independent hashes, then they are complete and , for some permutation .

The proof is elementary and omitted in the interest of space. This fact suggests that so long as a permutation spreads-out a set of IPs essentially uniformly, the empirical distribution of the hashes will be approximately uniform (see Fig. 1). Therefore, to obtain a high–quality set of complete hashes, it suffices to find an efficiently computable and invertible permutation . We do so next.

Fact 2.

Let and suppose that , for some integers . Define , as follows

Then, and are bijections and .

The proof is straightforward.

Remark 3.

Similar but more computationally intensive IP mangling methods (with efficient hardware implementation on an FPGA) were given in [12]. Shai Halevi [23] proposed alternative permutations that can be efficiently computed and inverted and exhibit good cryptographic properties. Interesting connections with the general theory of homomorphic coding can be further pursued.

In practice, with , we consider


We found that with works well. Observe also that if , then , for all integer . Thus, via powers of and in (3), one can easily generate different permutations. Our experiments with naturally occurring sets of IPs show powers or work slightly better than .

Iv Identification Algorithms

Having all building blocks in place, we now introduce our algorithms. Table I provides a roadmap.

Algorithm Sketch-size Application
Simple (IV-A) ) Scalar signals, top-1 hitter, suited
for stringent memory constraints
Max-Count (IV-B) ) Scalar signals, top-k hitters
Boyer-Moore (IV-C) ) Scalar signals, top-k hitters
Max-Sketch (IV-D) ) Set-valued signals, top-hitter
TABLE I: Algorithms Roadmap

Iv-a Simple Hashing Pursuit

0:  Number heavy-hitters:
1:  [Start] Initialize array of size .
2:  [Permute] Upon observing , compute
3:  [Hash] Compute . # NOTE:
4:  [Update] Update for :
5:  if (Populated for some time window) then
6:     if k==1 then
7:        Find ,  
8:        [decode] Output .
9:     else
10:        Initialize ,  
11:        for r=1 todo
12:           Find ,  
13:            #Exclude for next iteration
14:           [decode] Output .
15:        end for
16:     end if
17:  end if
Algorithm 1 Simple Hashing Pursuit (SHP)

Algorithm 1 is the simplest instance of IP randomization and hashing used to identify the heavy hitter for a scalar signal. Fig. 1 shows the resulting signal arrays , referred also as octet arrays. If is the first octet of the (encoded) IP address , then its payload is added to the bucket . If the permutation spreads out the range of IPs in the observed stream approximately uniformly, it is reasonable to expect that the resulting vector will be populated uniformly. A heavy hitter, however, will contribute abnormally large signal value to the bucket , where is the first octet of its encoded key. The heavy hitter similarly stands out in its encoded octet indexes and . By applying the inverse of the permutation to the largest bins in the octet arrays, we recover the address of the anomalous user (see also Meta-Alg. ii).

The fact that the permutation can be efficiently inverted is of utmost importance in practice since this algorithm is intended to be run online on fast network traffic streams. This is, in fact, feasible on the RTS as demonstrated in [12] with an efficient FPGA implementation of a similar and slightly more computationally demanding IP randomization.

Fig. 1: Algorithm 1’s data structure consists of hash arrays of size (here separated by the thick-dashed lines), one for each octet. Circles illustrate the selected maxima of each octet.

Iv-B Hashing Pursuit on Sub-streams

Algorithm 2, called Max-Count Hashing Pursuit, is an extension of SHP that minimizes collisions significantly, and hence improves identification accuracy. The idea is to introduce an additional dimension in our data structure and populate it by using another hash function . The secondary hash is used to divide the incoming stream into independent sub-streams (i.e., ‘thin’ the stream).

As stream entries arrive, we initially perform the permutation step discussed above to remove spatial localities. Then, having a uniform hash function as discussed in Section III, we apply it on the permuted key to get the index of the sub-stream. We update the appropriate array entries corresponding to sub-stream only.

0:  Number heavy-hitters:
1:  [Start] Initialize of size .
2:  Perform steps 2 & 3 of Algorithm 1
3:  [Thin] Calculate sub-stream index .
4:  [Update] Update for :
5:  if (Populated for some time window) then
6:     if k==1 then
7:        Construct matrix of size , i.e.,
8:        ,  ,
9:        Find ,  
10:        [decode] Output .
11:     else
12:        for r=1 todo
13:           ,  ,
14:           Find ,  
15:           Find ,  
16:            = 0,   #Exclude for next iteration
17:           [decode] Output .
18:        end for
19:     end if
20:  end if
Algorithm 2 Max-Count Hashing Pursuit

Iv-C Hash-thinned MJRTY Boyer-Moore

Algorithm 3 is based on the Boyer-Moore majority vote algorithm [24] (not to be confused with the Boyer-Moore algorithm for string matching used in signature-based detection tools like Snort), and the idea of stream thinning for creating sub-streams described above. The MJRTY Boyer-Moore algorithm can identify exactly the element in stream with the largest traffic assuming that its volume is at least 50% of the total volume (i.e., there is a majority). It solves the problem in time linear in the length of the input sequence and constant memory. In reality, though, a single IP or flow is not usually responsible for such a high fraction of the total volume of the stream. Hence, identification accuracy of the plain Boyer-Moore algorithm could be lacking. To overcome this, we employ the Boyer-Moore idea on sub-streams of , constructed as described above. Thus, with, say, sub-streams the ‘majority-threshold’ for each sub-stream becomes % (%) of stream volume. In practice, as Fig. 3 illustrates, this makes the Boyer-Moore-based algorithm to perform remarkably well, but one could use a higher to further thin stream , if needed.

We describe the original Boyer-Moore algorithm with an analogy to the one-dimensional random walk on the line of non-negative integers. A variable is initialized to (i.e., the origin) and a candidate variable cand is reserved for use. Once a new IP arrives2, we check to see if count is 0. If it is, that IP is set to be the new candidate cand and we move count one step-up, i.e. . Otherwise, if the IP is the same as cand, then cand remains unchanged and count is incremented, and, if not, count moves one step-down (decremented). We then proceed to the next IP and repeat the procedure. Provably, when all IPs are read, cand will hold the one with majority, if majority exists.

We implemented a natural extension of the MJRTY Boyer-Moore algorithm to the case of byte and packet counts and applied it in parallel on the sub-streams resulting from hash-thinning. For each sub-stream, we also maintained an additional small and constant size data structure, used to estimate the signal volume.

0:  Number heavy-hitters: ,
1:  [Start] Initialize ,  , .
2:  Initialize , ,  
3:  Upon observing , compute .
4:  [Thin] Calculate sub-stream index .
5:  [Hash] Compute ,
6:  [Update]
7:  if  then
8:     [s] = , [s] = , [s,j] =
9:  else
10:     if  then
11:        [s,j] = [s,j] + , [s] = [s] +
12:     else
13:        if  then
14:            = -
15:           if  then
16:              [s] = , [s] = -[s] # reset candidate
17:           end if
18:           [s,j] = [s,j] +
19:        else
20:           [s] = , [s] = # reset candidate
21:           [s,j] = [s,j] +
22:        end if
23:     end if
24:  end if
25:  if (Populated for some time window) then
26:     ,  
27:     Initialize
28:     for r=1 todo
29:        Find
30:         #Exclude for next iteration
31:        Output cand
32:     end for
33:  end if
Algorithm 3 Hash-thinned MJRTY Boyer-Moore

Iv-D Applications to more complex signals

In this section we extend the randomized domain hashing approach to the detection of more complex anomalies. As discussed in Section II, one common scenario could involve a “persistent” user that accesses abnormally large number of ports. Note that the overall volume or frequency of packets from user need not be abnormal. Therefore, such anomalous activity may be nearly impossible to detect from algorithms that merely detect large volume hitters. To address this problem with constant and small-size memory, we propose next another algorithm, which combines the randomized hashing and max-stable sketches [25].

These sketches exploit the max-stability property (4) of the Fréchet distribution. Recall that a random variable is -Fréchet if , for , and 0, otherwise, for some scale . If , then is called standard -Fréchet. Let be independent standard 1-Fréchet random variables. Then, it is known [25] that


where ’’ denotes equality in distribution. Thus, is a 1-Fréchet variable with scale coefficient . One can easily express the median of a 1-Fréchet variable with scale coefficient by considering that = 1/2. By the definition of the 1-Fréchet distribution and some algebra, one obtains .

The theory above can be employed to estimate the number of elements of a set. Specifically, consider the problem of finding the number of unique destinations ports contacted by a given host. When a stream element arrives, we generate a 1-Fréchet variable with seed that is a function of . Thus, if the same port arrives, the exact pseudo-random number is generated. As different ports arrive, we sequentially update our sketch by taking the maximum of the new 1-Fréchet variable and the sketch entry. Essentially, we are building one realization of the random variable described above. The number of different terms ’s will correspond to the number of different ports.

Since, we want to take the median of to estimate the scale coefficient , we independently keep realizations of the above procedure (see Alg. 4).

0:  Size of max-sketch
1:  [Start] Initialize array of size ,
2:  [Permute] Upon observing , compute
3:  [Hash] Compute .
4:  [RNG] With random seed , and , generate independent 1–Fréchet random variables .
5:  [Update] Update for , :
6:  if (Populated for some time window) then
7:     Construct matrix of size , i.e.,
8:     ,   and
9:     Find ,  
10:     [Decode] Output .
11:  end if
Algorithm 4 Max-Stable Hashing Pursuit (MSHP)

V On the accuracy of identification

The proposed algorithms (except the hash-tinned MJRTY Boyer-Moore) are closely related to the general family of sub-linear algorithms in the theoretical work of [16]. Performance guarantees for our algorithms can thus be established in a similar way as for the so-called Euclidean approximate sparse recovery systems of [17].

To understand better the role of collisions, however, we adopt a different approach and provide performance guarantees under the following simplifying assumptions. All proofs are available in the Appendix.

Assumption 1. (exact -sparcity) The signal has precisely non-zero (non-empty) entries at keys .

Assumption 2. (separation) The magnitude functionals of the heavy-hitters are all different: Define to be the cumulative magnitude of the bottom heavy hitters. By convention set

Exact recovery guarantees. Clearly, under Assumption 2, all our algorithms will identify the top- heavy hitters exactly, provided that there are no hash-collisions. Let denote the probability that the top- heavy hitters are correctly identified for a -sparse signal . The following results provide lower bounds on under various conditions.

Theorem 1.

Let and suppose that . Then, for the Simple Hashing Pursuit algorithm:

Note: With and the choice of hash functions as in Section III, we have and .

Remark 4 (the birthday problem).

For and , the first bound in (1) can be interpreted as the probability that a class of students have all different birthdays, if the year has days. It is well-known that this probability decays rather quickly as grows.

The max-count algorithm addresses this curse of the birthday problem by effectively increasing the value .

Corollary 1.

Under the Assumptions of Proposition 1, for the Max-count Hashing Algorithm, the bounds in (1) apply with replaced by .

Since the Boyer–Moore algorithm uses one hash function (as opposed to ), we obtain the following.

Corollary 2.

Under the Assumptions of Proposition 1, for the hash-thinned MJRTY Boyer-Moore Algorithm, the bounds in (1) apply with .

Bounds on the rate of identification. The exact identification of the top- hitters is difficult as the above performance bounds indicate since a few collisions may lead to misspecification. Even in the presence of collisions, however, a relatively large proportion of the hitters is identified in practice (see e.g. Fig. 3 below). This suggests examining the rate of identification quantity:


where denotes the number of correctly identified heavy hitters among the top . That is, gives the average proportion of identified top hitters.

The combinatorial analysis needed to establish bounds on for the Simple Hashing and Max-count algorithms is rather involved and delegated to a follow-up paper. Under certain conditions, however, an appealing closed-form expression for is available for the hash-thinned MJRTY Boyer-More algorithm. The conditions may be relaxed with the help of technical probabilistic analysis, which merits a separate investigation.

Theorem 2.

Suppose that for all . Then, under Assumptions 1 & 2, for the hash-thinned MJRTY Boyer-Moore Algorithm, we have the exact expression

Fig. 2: Performance bounds given the Assumptions of Section V.

Vi Performance Evaluation

Next, we evaluate our algorithms with Netflow traffic traces provided from the academic ISP Merit Network. The traces are collected at a large edge router. In the hourly-long trace we investigate, our Netflow stream has unique source addresses, unique destinations, and consists of million flow records. The total volume is gigabytes and million packets. We demonstrate results using Python software implementations of our algorithms.

Accuracy performance analysis. Fig. 3 shows the identification accuracy of Algorithms 13, compared to the ground truth (obtained by counting using a dynamic hash array, and then sorting). We search for heavy hitters both in byte volume, as well as frequency of occurrence. We use two metrics to evaluate our methods: (i) identification rate and (ii) exact recovery. The first gives the expected proportion of identified heavy hitters among the top-, while the second one is much more strict and measures the probability that all top- hitters are correctly recovered.

(a) Frequency, ident. rate
(b) Payload, ident. rate
(c) Frequency, exact metric
(d) Payload, exact metric
Fig. 3: Identification accuracy results (window=100K).

As Fig. 3 depicts, hash-thinned Boyer-Moore and Max-Count perform remarkably well. The window of records we used is K (i.e., we report the culprits every K netflow records), which corresponds (for the stream studied) on the RTS to slots of minutes. The performance of the SHP deteriorates as the number of top heavy hitters sought increases which is an expected outcome of the birthday problem discussed above. Therefore, SHP is well-suited when stringent memory constraints are imposed, and when only the top-hitter is of interest. Fig. 4 provides a sensitivity analysis w.r.t the window size that one could perform to find the optimal monitoring window size.

(a) Frequency, top heavy-hitter
(b) Frequency top-2 hitters
(c) Frequency top-3 hitters
Fig. 4: Identification accuracy (using the exact criterion) with varying window.

Table II shows the estimation accuracy of our methods. We juxtapose the actual traffic volume of the heavy-hitter versus an estimate obtain from our sketches. For Algorithms 12, we estimate the volume by taking the average of the 4 octets of the ary hash arrays. For thinned Boyer-Moore, we report the value of array . Indeed, Algorithm 1’s collisions are inevitable, but for the other two, collisions are spread nicely, and the estimated signal closely approximates the true value (here we round to closest digit). More sophisticated estimates (e.g., see [11] that uses a median-based estimate to boost confidence) can also be considered.

Time 1 2 3 4 5 6 7 8
Simple 88 88 88 87 93 92 91 62
Max-Count 100 100 100 100 100 100 100 100
Boyer-Moore 100 100 100 100 100 100 100 100
TABLE II: Estimation accuracy (%) of proposed algorithms for top hitter (in bytes) compared to the exact value (window size = 500K).

Table III illustrates the small memory requirements of our methods. The Max-Count and Boyer-Moore-based algorithms consume considerably less than 1MB of fast memory, which remains constant. On the other hand, the naïve approach of using a dynamically varying hash-table, has increasing memory requirement. To showcase the memory increase the algorithm was run against a stream with an hourly total byte volume of gigabytes. As demonstrated in Table III, as the monitoring window grows (or equivalently, as the amount of traffic increases), memory increases exponentially.

Window Size ( 50K) 1 2–7 8–19 20
Exact .8 3 6 13
Boyer-Moore (==256) .5 .5 .5 .5
Max-Count (=4, =256, =50) .4 .4 .4 .4
TABLE III: Memory Utilization (in MB)

Max-Stable sketch case studies. We illustrate Algorithm 4’s accuracy by looking at set-valued signals, tailored for detecting: (i) host scanning, (ii) port scanning, and (iii) a signal based on time to live (TTL) values.

Fig. 4(a) shows the accuracy on host scanning, where the goal is to identify the source IPs that contact large number of destination IPs. A filled-circle on the dotted blue line indicates the occurrences we identify exactly the heavy-hitter IP. The solid gray line shows the exact number of different hosts that the malicious scanner has contacted. The dotted-blue line shows our algorithm’s estimate for the set cardinality obtained by simply averaging, as described above. The identification accuracy is very high, and the misses occur when the set-size is indeed quite small.

In Fig. 4(c), our signal updates involve the set of ports that a given source IP interacts with. Hence, this could be used to identify port scanners. While the accuracy is still quite high, we observe a few mis-identifications. This is because the number of unique ports () is much smaller than the unique IPs (). As a result, collisions in our hash-arrays matter more in this case; background “noise” from non-malicious IPs could alter the ranking of even one of the buckets selected on the identification step, and result in mis-identification.

(a) Host scanning
(b) Port scanning
(c) IP spoofing
Fig. 5: Evaluation of Algorithm 4 on the Netflow stream (window=100K).

In Fig. 4(c), we apply the max-stable sketch on Darknet data (known also as Internet background radiation) [26]. The dataset available does not include Netflow records, but instead consists of packets captured at a network interface (this 1-hour long trace has million packets). Darknets are composed of traffic destined at unallocated address spaces (i.e., dark spaces). It is therefore directly associated with malicious acts or misconfigurations. Algorithm 4 is used to identify packets whose source IP has been spoofed. To achieve this we look at the TTL values of each packet. For the problem at hand, a heavy-hitter is a source IP whose set of unique TTL values in the monitoring window is large – an indication of spoofing [27]. The evaluation results illustrate that the culprit is correctly identified in around 84% of the cases, and the result for deteriorating performance is again due to the small cardinality of the unique set of TTL values (which is in theory, but in practice less unique TTL values are encountered).

Vii Conclusions

We presented a family of algorithms that are well-suited for online implementation in fast network streams. In addition, our framework can be employed to find heavy-hitters on a variety of signals, including complex ones that involve operations with sets. Further, our algorithms are amenable for distributed implementation; in a decentralized setting, the proposed data structures could be constructed at various observation points (i.e., the switch/router) and then transmitted for aggregation at centralized decision centers due to two important properties: (i) they are small and constant in size and, hence, can be efficiently emitted over the network to a centralized location, (ii) they can be linearly combined/aggregated by the centralized worker and reduced to a single sketch object that can be utilized to yield the final culprits.

Appendix A Proofs

A-a Exact recovery guarantees

The proof of Theorem 1 follows.


For each hash function , let be the event that the top- heavy hitters are hashed into different bins, and at the same time, the remaining hitters are hashed into the remaining bins. That is, the bins of the top- hitters involve no collisions. By the independence of , we have


If the event occurs, then the top- heavy hitters will be correctly identified. Thus, the independence of the events in implies that

which by (1) yields the first bound. The second bound follows from the product comparison inequality

valid for all . Indeed, setting and , we get

which gives the second bound. ∎

For Corollary 1, we have:


The result follows by observing that hash-thinning with an independent uniform hash function taking values leads to bins in (8). ∎

A-B Bounds on the rate of identification

The proof of Theorem 2 is as follows.


The assumption guarantees that equals the number of distinct values in the set of hashes . Let if bin is occupied and otherwise, for Observe that


by exchangeability. Note, however, that

since are independent and Uniform.

This, in view of (6) and (9), implies the first relation in (2). The second follows from the standard approximation


  1. We used a programmers’ syntax for variable updates.
  2. We focus on single IPs. Everything we say here, though, apply on (src,dst) pairs, and tuples of kind (src, sport, dst, sport) where sport/dport are the source and destination ports.


  1. C. Rossow, “Amplification Hell: Revisiting Network Protocols for DDoS Abuse,” in 2014 Network and Distributed System Security Symposium, NDSS, San Diego, CA, 2014.
  2. A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” SIGCOMM Comput. Commun. Rev., vol. 34, pp. 219–230, August 2004.
  3. P. Barford, J. Kline, D. Plonka, and A. Ron, “A signal analysis of network traffic anomalies,” in 2nd ACM SIGCOMM Workshop on Internet measurement, NY, USA, 2002, pp. 71–82.
  4. T. Idé, “Eigenspace-based anomaly detection in computer systems,” in 10th ACM SIGKDD (KDD), 2004, pp. 440–449.
  5. J. Zhang, J. Rexford, and J. Feigenbaum, “Learning-based anomaly detection in BGP updates,” in 2005 ACM SIGCOMM MineNet, NY, USA, 2005, pp. 219–220.
  6. M. Thottan, G. Liu, and C. Ji, “Anomaly detection approaches for communication networks,” in Algorithms for Next Generation Networks, 2010, pp. 239–261.
  7. G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” J. Algorithms, vol. 55, no. 1, pp. 58–75, Apr. 2005.
  8. G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, “Finding hierarchical heavy hitters in data streams,” in VLDB ’03, 2003, pp. 464–475.
  9. G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava, “Holistic UDAFs at streaming speeds,” in ACM SIGMOD, NY, USA, 2004, pp. 35–46.
  10. A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, “Algorithmic linear dimension reduction in the ℓ1 norm for sparse vectors,” in Allerton 2006, 2006.
  11. B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change detection: methods, evaluation, and applications,” in 3rd ACM SIGCOMM IMC, NY, USA, 2003, pp. 234–247.
  12. R. Schweller, Z. Li, Y. Chen, Y. Gao, A. Gupta, Y. Zhang, P. Dinda, M.-Y. Kao, and G. Memik, “Reverse hashing for high-speed network monitoring: Algorithms, evaluation, and applications,” in INFOCOM 2006, 2006, pp. 1–12.
  13. G. Cormode and S. Muthukrishnan, “What’s new: finding significant differences in network data streams,” in INFOCOM 2004, vol. 3, 2004, pp. 1534–1545 vol.3.
  14. Y. Gao, Z. Li, and Y. Chen, “A DoS Resilient Flow-level Intrusion Detection Approach for High-speed Networks,” in International Conference on Distributed Computing Systems, 2006, pp. 39–39.
  15. G. Cormode and S. Muthukrishnan, “What’s hot and what’s not: Tracking most frequent items dynamically,” ACM Trans. Database Syst., vol. 30, no. 1, pp. 249–278, Mar. 2005.
  16. E. Porat and M. J. Strauss, “Sublinear time, measurement-optimal, sparse recovery for all,” in ACM-SIAM SODA, 2012, pp. 1215–1227.
  17. A. Gilbert, Y. Li, E. Porat, and M. Strauss, “Approximate sparse recovery: Optimizing time and measurements,” SIAM Journal on Computing, vol. 41, no. 2, pp. 436–453, 2012.
  18. S. Muthukrishnan, “Data streams: Algorithms and applications,” Found. Trends Theor. CS, vol. 1, no. 2, pp. 117–236, Aug. 2005.
  19. J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” Information Theory, IEEE Transactions on, vol. 53, no. 12, pp. 4655–4666, Dec 2007.
  20. A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, “One sketch for all: Fast algorithms for compressed sensing,” in STOC ’07, NY, USA, 2007, pp. 237–246.
  21. D. Donoho, “Compressed sensing,” Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 1289–1306, April 2006.
  22. P. Indyk, “Explicit constructions for compressed sensing of sparse signals,” in 19-th ACM-SIAM SODA.    PA, USA: Society for Industrial and Applied Mathematics, 2008, pp. 30–33.
  23. S. Halevi, “Invertible universal hashing and the TET encryption mode,” in Advances in Cryptology - CRYPTO 2007, ser. Lecture Notes in CS, A. Menezes, Ed., 2007, vol. 4622, pp. 412–429.
  24. R. Boyer and J. Moore, “MJRTY - a fast majority vote algorithm,” in Automated Reasoning, ser. Automated Reasoning Series, R. Boyer, Ed., 1991, vol. 1, pp. 105–117.
  25. S. Stoev, M. Hadjieleftheriou, G. Kollios, and M. Taqqu, “Norm, point, and distance estimation over multiple signals using max-stable distributions,” in IEEE 23rd International Conference on Data Engineering, April 2007, pp. 1006–1015.
  26. E. Wustrow, M. Karir, M. Bailey, F. Jahanian, and G. Houston, “Internet Background Radiation Revisited,” in 10th ACM SIGCOMM IMC, Melbourne, Australia, November 2010.
  27. R. Beverly and S. Bauer, “The spoofer project: Inferring the extent of source address filtering on the internet,” in SRUTI’05.    Berkeley, CA, USA: USENIX Association, 2005, pp. 8–8.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description