Utilizing Dynamic Properties of Sharing Bits and Registers to Estimate User Cardinalities over Time *Peng Jia, Jing Tao and Xiaohong Guan are corresponding authors.

Utilizing Dynamic Properties of Sharing Bits and Registers to Estimate User Cardinalities over Time thanks: *Peng Jia, Jing Tao and Xiaohong Guan are corresponding authors.

Pinghui Wang, Peng Jia, Xiangliang Zhang, Jing Tao, Xiaohong Guan, Don Towsley
MOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, China
Shenzhen Research Institute of Xi’an Jiaotong University, Shenzhen, China
King Abdullah University of Science and Technology, Thuwal, SA
Zhejiang Research Institute of Xi’an Jiaotong University, Hangzhou, China
Department of Automation and NLIST Lab, Tsinghua University, Beijing, China
School of Computer Science, University of Massachusetts Amherst, MA, USA
Email: {phwang, jtao, xhguan, pengjia}@sei.xjtu.edu.cn, xiangliang.zhang@kaust.edu.sa,
towsley@cs.umass.edu
Abstract

Online monitoring user cardinalities (or degrees) in graph streams is fundamental for many applications. For example in a bipartite graph representing user-website visiting activities, user cardinalities (the number of distinct visited websites) are monitored to report network anomalies. These real-world graph streams may contain user-item duplicates and have a huge number of distinct user-item pairs, therefore, it is infeasible to exactly compute user cardinalities when memory and computation resources are limited. Existing methods are designed to approximately estimate user cardinalities, whose accuracy highly depends on parameters that are not easy to set. Moreover, these methods cannot provide anytime-available estimation, as the user cardinalities are computed at the end of the data stream. Real-time applications such as anomaly detection require that user cardinalities are estimated on the fly. To address these problems, we develop novel bit and register sharing algorithms, which use a bit array and a register array to build a compact sketch of all users’ connected items respectively. Compared with previous bit and register sharing methods, our algorithms exploit the dynamic properties of the bit and register arrays (e.g., the fraction of zero bits in the bit array at each time) to significantly improve the estimation accuracy, and have low time complexity () to update the estimations each time they observe a new user-item pair. In addition, our algorithms are simple and easy to use, without requirements to tune any parameter. We evaluate the performance of our methods on real-world datasets. The experimental results demonstrate that our methods are several times more accurate and faster than state-of-the-art methods using the same amount of memory.

I Introduction

Many real-world networks are given in the form of graph streams. Calling network is such an example with nodes representing users and an edge representing a call from one user to another. When web surfing activities are modeled as a bipartite graph stream where users and items refer to network hosts and websites respectively, an edge represents a visit by a user to a website. Monitoring the cardinalities (or degrees) of users in these networks is fundamental for many applications such as network anomaly detection [17, 54, 36, 51], where a user’s cardinality is defined to be the number of distinct users/items that the user connects to in the regular/bipartite graph stream of interest. Due to the large-size and high-speed nature of these graph streams, it is infeasible to collect the entire graph especially when the computation and memory resources are limited. For example, network routers have fast but very small memories, which leads their traffic monitoring modules incapable to exactly compute the cardinalities of network users. Therefore, it is important to develop fast and memory efficient algorithms to approximately compute user cardinalities over time.

Compared with using a counter to record a user’s frequency (i.e., the number of times the user occurred) over time, one needs to build a hash table of distinct occurred edges to handle edge duplicates in graph streams when computing user cardinalities. Therefore, computing user cardinalities is more complex and difficult than computing user frequencies for large data streams, and frequency estimation methods such as Count-Min sketch [10] fails to approximate user cardinalities. To address this challenge, a variety of cardinality estimation methods such as Linear-Time Probabilistic Counting (LPC) [46] and HyperLogLog (HLL) [19] are developed to approximately compute cardinalities. An LPC/HLL sketch consists of bits/registers, where is a parameter affecting the estimation accuracy. Since user cardinalities are not known in advance and change over time, one needs to set large (e.g., thousand) to achieve reasonable accuracy for each user, whose cardinality may vary over a large range. However, this method wastes considerable memory because a large value of is not necessary for most users, which have small cardinalities. To solve this problem, [54, 50, 43, 47] develop different virtual sketch methods to compress each user’s LPC/HLL sketch into a large bit/register array shared by all users. These virtual sketch methods build each user’s virtual LPC/HLL sketch using bits/registers randomly selected from the large bit/register array. This significantly reduces memory usage because each bit/register may be used by more than one user. However, bits/registers in a user’s virtual LPC/HLL sketch may be contaminated by other users. We refer to these bits/registers as “noisy” bits/registers. In practice, most users have small cardinalities and their virtual LPC/HLL sketches tend to contain many “noisy” bits/registers, which results in large estimation errors. Another limitation of existing methods is that they are unable to report user cardinalities on the fly, because they are customized to estimate user cardinalities after all the data has been observed. For real-time applications like on-line anomaly detection, it is important to track user cardinalities in real-time. For example, network monitoring systems are required to detect abnormal IP addresses such as super spreaders (i.e., IP addresses with cardinalities larger than a specified threshold) on the fly. Moreover, online monitoring of IP address cardinalities over time also facilitates online detection of stealthy attacks launched from a subclass of IP addresses.

To address the above challenges, we develop two novel streaming algorithms FreeBS and FreeRS to accurately estimate user cardinalities over time. We summarize our main contributions as:
Compared with previous bit and register sharing methods, our algorithms FreeBS and FreeRS exploit the dynamic properties of the bit and register arrays over time (e.g., the fraction of zero bits in the bit array at each time) to significantly improve the estimation accuracy. To be more specific, FreeBS/FreeRS allows the number of bits/registers used by a user to dynamically increase as its cardinality increases over time and each user can use all shared bits/registers, which results in more accurate user cardinality estimations.

Our algorithms report user cardinality estimations on the fly and allow to track user cardinalities in real-time. The time complexity is reduced from in state-of-the-art methods CSE [50] and vHLL [47] to for updating user cardinality estimations each time they observe a new user-item pair.

We evaluate the performance of our methods on real-world datasets. Experimental results demonstrate that our methods are orders of magnitude faster and up to 10,000 times more accurate than state-of-the-art methods using the same amount of memory.

The rest of this paper is organized as follows. The problem formulation is presented in Section II. Section III introduces preliminaries. Section IV presents our algorithms FreeBS and FreeRS. The performance evaluation and testing results are presented in Section V. Section VI summarizes related work. Concluding remarks then follow.

Ii Problem Formulation

To formally define our problem, we first introduce some notation. Let be the graph stream of interest consisting of a sequence of edges. Note that an edge in may appear more than once. In this paper, we focus on bipartite graph streams consisting of edges between users and items. Our methods however easily extend to regular graphs. Let and denote the user and item sets, respectively. For , let denote the edge of , where and are the user and the item of respectively. Let denote the set of distinct items that user connects to before and including time . Define to be the cardinality of user at time . Then, is the sum of all user cardinalities at time . In this paper, we develop fast and accurate methods for estimating user cardinalities at times using a limited amount of memory. When no confusion arises, we omit the superscript to ease exposition.

Iii Preliminaries

(a) CSE
(b) vHLL
Figure 1: Overview of bit sharing method CSE and register sharing method vHLL. Virtual CSE/vHLL sketches of users may contain “noisy” bits/registers (e.g., the bit and register in red and bold in the figure).

Iii-a Estimating a Single User’s Cardinality

Iii-A1 Linear-Time Probabilistic Counting

For a user , Linear-Time Probabilistic Counting (LPC) [46] builds a sketch to store the set of items that connects to, i.e., . Formally, is defined as a bit array consisting of bits, which are initialized to zero. Let be a uniform hash function with range . When user-item pair arrives, the bit in is set to one, i.e., . For any bit , , the probability that it remains zero at time is . Denote by the number of zero bits in at time . Then, the expectation of is computed as . Based on the above equation, when , Whang et al. [46] estimate as

The range of is , and its expectation and variance are computed as

Iii-A2 HyperLogLog

To estimate the cardinality of user , HyperLogLog (HLL) [19] is developed based on the Flajolet-Martin (FM) sketch [20] consisting of registers . All registers are initialized to . For , let be the value of at time . When a user-item pair arrives, HLL maps the item into a pair of random variables and , where is an integer uniformly selected from {1, …, } at random and is drawn from a distribution, for . 111Functions and are usually implemented as: Let be the binary format of the output of a hash function , and . Then, is defined as and is defined as the number of leading zeros in plus one. Then, register is updated as

At time , Flajolet et al. [19] estimate as

where is the following constant to correct for bias

The above formula for is complicated. In practice, is computed numerically, e.g., , , , and for . The error of is analyzed as

where and represent oscillating functions of a tiny amplitude (e.g., and as soon as ) which can be safely neglected for all practical purposes, and is a constant for a specific , e.g., , , , , and .

Since is severely biased for small cardinalities, HLL treats the HLL sketch as an LPC sketch (i.e., a bitmap of bits) when . In this case, is estimated as , where is the number of registers among , , that equal 0 at time . Therefore, we easily find that LPC outperforms HLL for small cardinalities under the same memory usage.

Iii-A3 Discussions

To compute all user cardinalities, one can use an LPC or HLL sketch to estimate each user cardinality. Clearly, using a small , LPC and HLL will exhibit large errors for users with large cardinalities. Most user cardinalities are small and assigning an LPC/HLL sketch with large to each user in order to accurately estimate large user cardinalities is wasteful as LPC and HLL do not require to set a large to achieve reasonable estimation accuracy for users with small cardinalities. In practice, the user cardinalities are not known in advance and vary over time. Therefore, it is difficult to set an optimal value of when using LPC and HLL to estimate all user cardinalities. In the next subsection, we introduce state-of-the-art methods to address this problem, and also discuss their shortcomings.

Iii-B Estimating All User Cardinalities

Iii-B1 CSE: Compressing LPC Sketches of All Users into a Shared Bit Array

As shown in Figure 1 (a), CSE [50] consists of a large bit array and independent hash functions , each mapping users to , where is the length of the one dimensional bit array . Similar to LPC, CSE builds a virtual LPC sketch for each user and embeds LPC sketches of all users in . For user , its virtual LPC sketch consists of bits selected randomly from by the group of hash functions , that is

Each bit in is initially set to zero. When a user-item pair arrives, CSE sets the bit in to one. Similar to LPC, is a uniform hash function with range . Since the element in is bit , CSE only needs to set bit , i.e., . Let be the number of zero bits in and be the number of zero bits in at time . A user’s virtual LPC sketch can be viewed as a regular LPC sketch containing “noisy” bits (e.g., the bit in red and bold in Figure 1 (a)), that are wrongly set from zero to one by items of other users. To remove the estimation error introduced by “noisy” bits, Yoon et al. [50] estimate as

On the right-hand side of the above equation, the first term is the same as the regular LPC, and the second term corrects the error introduced by “noisy” bits. The bias and variance of are given by eqs. (23) and (24) in the original paper [50].

Iii-B2 vHLL: Compressing HLL Sketches of All Users into a Shared Bit Array

Xiao et al. [47] develop a register sharing method, vHLL, which extends the HLL method to estimate all user cardinalities. vHLL consists of a list of registers , which are initialized to zero. To maintain the virtual HLL sketch of a user , vHLL uses independent hash functions to randomly select registers from all registers , where each function maps users to . Formally, is defined as

For , let be the value of at time . When a user-item pair arrives, it maps the item to a pair of two random variables and , where is an integer uniformly selected from {1, …, } at random, and is a random integer drawn from a distribution, which is similar to HLL. We can easily find that the element in the virtual HLL sketch of user is , therefore, vHLL only needs to update register as

A user’s virtual HLL sketch can be viewed as a regular HLL containing “noisy” registers (e.g., the register in red and bold in Figure 1 (b)), which are wrongly set by items of other users. To remove the estimation error introduced by “noisy” registers, Xiao et al. [47] estimate as

where is the same as that of HLL. For the two terms between the parentheses on the right-hand side of the above equation, the first term is the same as the regular HLL, and the second term corrects the error introduced by “noisy” registers. Similar to the regular HLL, the first term is replaced by when where is the number of registers among , , that equal 0 at time . The expectation of approximately equals , that is, . The variance of is approximately computed as , where counts the total number of distinct user-item pairs occurred before and including time .

Iii-C Unsolved Challenges

Challenge 1: It is difficult to set parameter for both CSE and vHLL. The estimation accuracy of CSE and vHLL highly depends on the value of , as we can see in above discussions. Increasing introduces more “unused” bits in virtual LPC sketches of occurred users, which can become contaminated with noise. Here “unused” bits refer to the bits in a user’s virtual LPC sketch that no user-item pairs of the user are hashed into. However, decreasing introduces large estimation errors for users with large cardinalities. Similarly, vHLL also confronts the same challenge in determining an optimal . Later our experimental results will also verify that errors increase with for users with small cardinalities under CSE and vHLL.

Challenge 2: It is computationally intensive to estimate user cardinalities for all values of . At each time, both CSE and vHLL require time complexity to compute the cardinality of a single user. When applied to compute cardinalities for all users in at all times, CSE and vHLL have to be repeatedly called and will incur high computational cost, which prohibits their application to high speed streams in an on-line manner.

Iv Our Methods

In this section, we present our streaming algorithms FreeBS and FreeRS for estimating user cardinalities over time. FreeBS and FreeRS are designed based on two novel bit sharing and register sharing techniques, respectively. The basic idea behind our methods can be summarized as: Unlike vHLL/CSE mapping each user’s items into bits/registers, FreeBS/FreeRS randomly maps user-item pairs into all bits/registers in the bit/register array. Thus, users with larger cardinalities (i.e., connecting to a larger number of items) tend to use more bits/registers. For each user-item pair occurred at time , we discard it when updating does not change any bit/register in the bit/register array shared by all users. Otherwise, is a new user-item pair that does not occur before time , and we increase the cardinality estimation of user by , where is defined as the probability that a new user-item pair changes any bit/register in the bit/register array at time . To some extent, the above procedure can be viewed as a sampling method such that each new user-item pair arriving at time is sampled with probability and each user’s cardinality is estimated using the Horvitz-Thompson estimator [25].

Iv-a FreeBS: Parameter-Free Bit Sharing

Data Structure. The pseudo-code for FreeBS is shown as Algorithm 1. FreeBS consists of a one-dimensional bit array of length , where each bit , , is initialized to zero. In addition, FreeBS uses a hash function to uniformly and independently map each user-item pair to an integer in at random, i.e., , and , . Note that differs from the hash function used by CSE that maps user to an integer in at random.

;
, ;
;
foreach  in  do
       if  then
             ;
             ;
             ;
            
       end if
      
end foreach
Algorithm 1 The pseudo-code for FreeBS.

Update Procedure. When a user-item pair arrives at time , FreeBS first computes a random variable , and then sets to one, i.e., Let denote the set of the indices corresponding to zero bits in at time . Formally, is defined as Let denote the cardinality estimate for user at time . We initialize to 0. Next, we describe how is computed on-line. Let denote the number of zero bits in at time . Let denote the probability of changing a bit in from 0 to 1 at time . Formally, is defined as

Let denote the indicator function that equals 1 when predicate is true and 0 otherwise. Besides setting to 1 at time with the arrival of user-item pair , we also update the cardinality estimate of user as

For any other user , we keep its cardinality estimate unchanged, i.e., . We easily find that can be fast computed incrementally. That is, we initialize to 1, and incrementally compute as

Hence, the time complexity of FreeBS for processing each user-item pair is .

Error Analysis. Let denote the set of the first occurrence times of user-item pairs associated with user in stream . Formally, we define as

Theorem 1

The expectation and variance of are

where

Proof. Let denote an indicator that equals 1 when updating a user-item pair incurs a value change of (i.e., changes from 0 to 1), and 0 otherwise. We easily have

For each , we have

Define Then, we have

Given , random variables , , are independent of each other. Then, we have

The variance of is computed as

Since , using the equation , we have

In what follows we derive the formula for . For specific distinct bits in , there exist ways to map distinct user-item pairs occurred in stream before and including time into these bits given that each bit has at least one user-item pair, where , the Stirling number of the second kind [2], is computed as

In addition, there exist ways to select distinct bits from , therefore we have

Then, we have

Next, we introduce a method to approximately compute . We expand the function by its Taylor series around as

From [46] (eqs.(5) and (6) in [46]), we easily have and . Then, we obtain .

Iv-B FreeRS: Parameter-Free Register Sharing

Data Structure. The pseudo-code for FreeRS is shown as Algorithm 2. FreeRS consists of registers , , , which are initialized to zero. In addition, FreeRS also uses a hash function to randomly map each user-item pair to an integer in and another function that maps to a random integer in according to a distribution. Note that and differ from hash functions and used by vHLL, which map user to a random integer in and , respectively.

Update Procedure. When user-item pair arrives at time , FreeRS first computes two random variables and , and then updates as

Let denote the probability of changing the value of a register among at time . Formally, is defined as

; ;
, ;
foreach  do
       if  then
             ;
             ;
             ;
            
       end if
      
end foreach
Algorithm 2 The pseudo-code for FreeRS.

Let denote the cardinality estimate of user at time . When user-item pair arrives at time , we update the cardinality estimate of user as

For any other user , we keep its cardinality estimate unchanged, i.e., . Similar to , we compute incrementally. In detail, we initialize and incrementally compute as

Hence, the time complexity of FreeRS for processing each user-item pair is also .

Error Analysis. We derive the error of as follows:

Theorem 2

The expectation and variance of are

where

with