A Framework for Estimating Stream Expression Cardinalities
Abstract
Given distributed data streams , we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over . We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.
1 Introduction
Consider an internet company that monitors the traffic flowing over its network by placing a sensor at each ingress and egress point. Because the volume of traffic is large, each sensor stores only a small sample of the observed traffic, using some simple sampling procedure. At some later point, the company decides that it wishes to estimate the number of unique users who satisfy a certain property and have communicated over its network. We refer to this as the DistinctOnSubPopulation problem, or Distinct for short. How can the company combine the samples computed by each sensor, in order to accurately estimate the answer to this query?
In the case that is the trivial property that is satisfied by all users, the answer to the query is simply the number of DistinctElements in the traffic stream, or Distinct for short. The problem of designing streaming algorithms and sampling procedures for estimating DistinctElements has been the subject of intense study. In general, however, may be significantly more complicated than the trivial property, and may not be known until query time. For example, the company may want to estimate the number of (unique) men in a certain age range, from a specified country, who accessed a certain set of websites during a designated time period, while excluding IP addresses belonging to a designated blacklist. This more general setting, where is a nontrivial ad hoc property, has received somewhat less attention than the basic Distinct problem.
In this paper, our goal is to identify a simple method for combining the samples from each sensor, so that the following holds. As long as each sensor is using a sampling procedure that satisfies a certain mild technical condition, then for any property , the combining procedure outputs an estimate for the Distinct problem that is unbiased. Moreover, its variance should be bounded by that of the individual sensors’ sampling procedures.^{1}^{1}1More precisely, we are interested in showing that the variance of the returned estimate is at most that of the (hypothetical) estimator obtained by running each individual sensor’s sampling algorithm on the concatenated stream . We refer to the latter estimator as “hypothetical” because it is typically infeasible to materialize the concatenated stream in distributed environments.
For reasons that will become clear later, we refer to our proposed combining procedure as the ThetaSketch Framework, and we refer to the mild technical condition that each sampling procedure must satisfy to guarantee unbiasedness as Goodness. If the sampling procedures satisfy an additional property that we refer to as monotonicity, then the variance of the estimate output by the combining procedure is guaranteed to satisfy the desired variance bound. The ThetaSketch Framework, and our analysis of it, unifies and generalizes a variety of results in the literature (see Section 2.5 for details).
The Importance of Generality. As we will see, there is a huge array of sampling procedures that the sensors could use. Each procedure comes with a unique tradeoff between accuracy, space requirements, update speed, and simplicity. Moreover, some of these procedures come with additional desirable properties, while others do not. We would like to support as many sampling procedures as possible, because the best one to use in any given given setting will depend on the relative importance of each resource in that setting.
Handling Set Expressions. The scenario described above can be modeled as follows. Each sensor observes a stream of identifiers from a data universe of size , and the goal is to estimate the number of distinct identifiers that satisfy property in the combined stream . In full generality, we may wish to handle more complicated set expressions applied to the constituent streams, other than setunion. For example, we may have streams of identifiers , and wish to estimate the number of distinct identifiers satisfying property that appear in all streams. The ThetaSketch Framework can be naturally extended to provide estimates for such queries. Our analysis applies to any sequence of set operations on the ’s, but we restrict our attention to setunion and setintersection throughout the paper for simplicity.
2 Preliminaries, Background, and Contributions
2.1 Notation and Assumptions
Streams and Set Operations. Throughout, denotes a stream of identifiers from a data universe . We view any property on identifiers as a subset of , and let denote the number of distinct identifiers that appear in and satisfy . For brevity, we let denote . When working in a multistream setting, denote streams of identifiers from , will denote the concatenation of the input streams, while denotes the set of identifiers that appear at least once in all streams. Because we are interested only in distinct counts, it does not matter for definitional purposes whether we view and as sets, or as multisets. For any property , and , while and .
Hash Functions. For simplicity and clarity, and following prior work (e.g. [5, 9]), we assume throughout that the sketching and sampling algorithms make use of a perfectly random hash function mapping the data universe to the open interval . That is, for each , is a uniform random number in . Given a subset of hash values computed from a stream , and a property , denotes the subset of hash values in whose corresponding identifiers in satisfy . Finally, given a stream , the notation refers to the set of hash values obtained by mapping a hash function over the distinct identifiers in .
2.2 Prior Art: Sketching Procedures for Distinct Queries
There is a sizeable literature on streaming algorithms for estimating the number of distinct elements in a single data stream. Some, but not all, of these algorithms can be modified to solve the Distinct problem for general properties . Depending on which functionality is required, systems based on HyperLogLog Sketches, K’th Minimum Value (KMV) Sketches, and Adaptive Sampling represent the state of the art for practical systems [21].^{2}^{2}2Algorithms with better asymptotic bitcomplexity are known [23], but they do not match the practical performance of the algorithms discussed here. See Section 6.3. For clarity of exposition, we defer a thorough overview of these algorithms to Section 6. Here, we briefly review the main concepts and relevant properties of each.
HLL: HyperLogLog Sketches. HLL is a sketching algorithm for the vanilla Distinct problem. Its accuracy per bit is superior to the KMV and Adaptive Sampling algorithms described below. However, unlike KMV and Adaptive Sampling, it is not known how to extend the HLL sketch to estimate for general properties (unless, of course, is known prior to stream processing).
KMV: K’th Minimum Value Sketches. The KMV sketching procedure for estimating Distinct works as follows. While processing an input stream , KMV keeps track of the set of the smallest unique hashed values of stream elements. The update time of a heapbased implementation of KMV is . The KMV estimator for Distinct is: , where denotes the smallest unique hash value.^{3}^{3}3Some works use the estimate , e.g. [4]. We use because it is unbiased, and for consistency with the work of Cohen and Kaplan [9] described below. It has been proved by [5], [19], and others, that , and Duffield et al. [11] proposed to change the heapbased implementation of the KMV sketching algorithm to an implementation based on quickselect [22]. This reduces the sketch update cost from to amortized . However, this hides a larger constant than competing methods. At the cost of storing the sampled identifiers, and not just their hash values, the KMV sketching procedure can be extended to estimate for any property (Section 6 has details).
Adaptive Sampling. Adaptive Sampling maintains a sampling level , and the set of all hash values less than ; whenever exceeds a prespecified size limit, is incremented and is scanned discarding any hash value that is now too big. Because a simple scan is cheaper than running quickselect, an implementation of this scheme is typically faster than KMV. The estimator of is . It has been proved by [13] that this estimator is unbiased, and that , where the approximation sign hides oscillations caused by the periodic culling of . Like KMV, Adaptive Sampling can be extended to estimate for any property . Although the stream processing speed of Adaptive Sampling is excellent, the fact that its accuracy oscillates as increases is a shortcoming.
HLL for set operations on streams. HLL can be directly adapted to handle setunion (see Section 6 for details). For setintersection, the relevant adaptation uses the inclusion/exclusion principle. However, the variance of this estimate is approximately a factor of worse than the variance achieved by the algorithm described below. When , this penalty factor overwhelms HLL’s fundamentally good accuracy per bit.
KMV for set operations on streams. Given streams , let denote the KMV sketch computed from stream . A trivial way to use these sketches to estimate the number of distinct items in the union stream is to let denote the smallest value in the union of the sketches, and let . Then is identical to the sketch that would have been obtained by running KMV directly on the concatenated stream , and hence is an unbiased estimator for , by the same analysis as in the singlestream setting. We refer to this procedure as the “nongrowing union rule.”
Intuitively, the nongrowing union rule does not use all of the information available to it. The sets contain up to distinct samples in total, but ignores all but the smallest samples. With this in mind, Cohen and Kaplan [9] proposed the following adaptation of KMV to handle unions of multiple streams. We denote their algorithm by , and also refer to it as the “growing union rule”.
For each KMV sketch computed from stream , let denote that sketch’s value of . Define , and . Then is estimated by , and is estimated by .
At first glance, it may seem obvious that the growing union rule yields an estimator that is “at least as good” as the nongrowing union, since the growing union rule makes use of at least as many samples as the nongrowing rule. However, it is by no means trivial to prove that is unbiased, nor that its variance is dominated by that of the nongrowing union rule. Nonetheless, [9] managed to prove this: they showed that is unbiased and has variance that is dominated by the variance of :
(1) 
As observed in [9], multiKMV can be adapted in a similar manner to handle setintersections (see Section 3.8 for details).
Adaptive Sampling for set operations on streams. Adaptive Sampling can handle set unions and intersections with a similar “growing union rule” in which “” . Here, denotes the threshold for discarding hash values that was computed by the th Adaptive Sampling sketch. We refer to this algorithm as . [18] proved epsilondelta bounds on the error of , but did not derive expressions for mean or variance. However, and are both special cases of our ThetaSketch Framework, and in Section 3 we will prove (apparently for the first time) that is unbiased, and satisfies strong variance bounds. These results have the following two advantages over the epsilondelta bounds of [18]. First, proving unbiasedness is crucial for obtaining estimators for distinct counts over subpopulations: these estimators are analyzed as a sum of a huge number of peritem estimates (see Theorem 3.10 for details), and biases add up. Second, variance bounds enable derivation of confidence intervals that an epsilondelta guarantee cannot provide, unless the guarantee holds for many values of delta simultaneously.
2.3 Overview of the ThetaSketch Framework
In this overview, we describe the ThetaSketch Framework in the multistream setting where the goal is to output , where (we define the framework formally in Section 2.4). That is, the goal is to identify a very large class of sampling algorithms that can run on each constituent stream , as well as a “universal” method for combining the samples from each to obtain a good estimator for . We clarify that the ThetaSketch Framework, and our analysis of it, yields unbiased estimators that are interesting even in the singlestream case, where .
We begin by noting the striking similarities between the and algorithms outlined in Section 2.2. In both cases, a sketch can be viewed as pair where is a certain threshold that depends on the stream, and is a set of hash values which are all strictly less than . In this view, both schemes use the same estimator , and also the same growing union rule for combining samples from multiple streams. The only difference lies in their respective rules for mapping streams to thresholds . The ThetaSketch Framework formalizes this pattern of similarities and differences.
The assumed form of the singlestream sampling algorithms. The ThetaSketch Framework demands that each constituent stream be processed by a sampling algorithm of the following form. While processing , evaluates a “threshold choosing function” (TCF) . The final state of must be of the form , where is the set of all hash values strictly less than that were observed while processing . If we want to estimate for nontrivial properties , then must also store the corresponding identifier that hashed to each value in . Note that the framework itself does not specify the thresholdchoosing functions . Rather, any specification of the TCFs defines a particular instantiation of the framework.
Remark. It might appear from Algorithm 1 that for any TCF , the function makes two passes over the input stream: one to compute , and another to compute . However, in all of the instantiations we consider, both operations can be performed in a single pass.
The universal combining rule. Given the states of each of the sampling algorithms when run on the streams , define , and (see the function in Algorithm 1). Then is estimated by , and as (see the function in Algorithm 1).
The analysis. Our analysis shows that, so long as each thresholdchoosing function satisfies a mild technical condition that we call 1Goodness, then is unbiased. We also show that if each satisfies a certain additional condition that we call monotonicity, then satisfies strong variance bounds (analogous to the bound of Equation (1) for ). Our analysis is arguably surprising, because Goodness does not imply certain properties that have traditionally been considered important, such as permutation invariance, or being a uniform random sample of the hashed unique items of the input stream.
Applicability. To demonstrate the generality of our analysis, we identify several valid instantiations of the ThetaSketch Framework. First, we show that the TCF’s used in KMV and Adaptive Sampling both satisfy Goodness and monotonicity, implying that and are both unbiased and satisfy the aforementioned variance bounds. For , this is a reproof of Cohen and Kaplan’s results [9], but for the results are new. Second, we identify a variant of KMV that we call , which is useful in multistream settings where the lengths of constituent streams are highly skewed. We show that satisfies both Goodness and monotonicity. Third, we introduce a new sampling procedure that we call the Alpha Algorithm. Unlike earlier algorithms, the Alpha Algorithm’s final state actually depends on the stream order, yet we show that it satisfies Goodness, and hence is unbiased in both the single and multistream settings. We also establish variance bounds on the Alpha Algorithm in the singlestream setting. We show experimentally that the Alpha Algorithm, in both the single and multistream settings, achieves a novel tradeoff between accuracy, space usage, update speed, and applicability.
Unlike KMV and Adaptive Sampling, the Alpha Algorithm does not satisfy monotonicity in general. In fact, we have identified contrived examples in the multistream setting on which the aforementioned variance bounds are (weakly) violated. The Alpha Algorithm does, however, satisfy monotonicity under the promise that the are pairwise disjoint, implying variance bounds in this case. Our experiments suggest that, in practice, the normalized variance in the multistream setting is not much larger than in the pairwise disjoint case.
Deployment of Algorithms.
Within Yahoo, the pKMV and Alpha algorithms are used widely. In particular, stream cardinalities in Yahoo empirically satisfy a power law, with some very large streams and many short ones, and pKMV is an attractive option for such settings. We have released an optimized opensource implementation of our algorithms at http://datasketches.github.io/.
2.4 Formal Definition of ThetaSketch Framework
The ThetaSketch Framework is defined as follows. This definition is specific to the multistream setting where the goal is to output , where is the union of constituent streams .
Definition 2.1.
The ThetaSketch Framework consists of the following components:

The data type , where is a threshold, and is the set of all unique hashed stream items that are less than . We will generically use the term “thetasketch” to refer to an instance of this data type.

The universal “combining function” , defined in Algorithm 1, that takes as input a collection of thetasketches (purportedly obtained by running () on constituent streams ), and returns a single thetasketch (purportedly of the union stream ).

The function , defined in Algorithm 1, that takes as input a thetasketch (purportedly obtained from some stream ) and a property and returns an estimate of .
Any instantiation of the ThetaSketch Framework must specify a “threshold choosing function” (TCF), denoted , that maps a target sketch size, a stream, and a hash function to a threshold . Any TCF implies a “base” sampling procedure () that maps a target size, a stream , and a hash function to a thetasketch using the pseudocode shown in Algorithm 1. One can obtain an estimate for by feeding the resulting thetasketch into ().
Given constituent streams , the instantiation obtains an estimate of by running () on each constituent stream , feeding the resulting thetasketches to () to obtain a “combined” thetasketch for , and then running () on this combined sketch.
Remark. Definition 2.1 assumes for simplicity that the same TCF is used in the base sampling algorithms run on each of the constituent streams. However, all of our results that depend only on Goodness (e.g. unbiasedness of estimates and noncorrelation of “peritem estimates”) hold even if different Good TCF’s are used on each stream, and even if different values of are employed.
2.5 Summary of Contributions
In summary, our contributions are: (1) Formulating the ThetaSketch Framework. (2) Identifying a mild technical condition (Goodness) on TCF’s ensuring that the framework’s estimators are unbiased. If each TCF also satisfies a monotonicity condition, the framework’s estimators come with strong variance bounds analogous to Equation (1). (3) Proving , , and all satisfy Goodness and monotonicity, implying unbiasedness and variance bounds for each. (4) Introducing the Alpha Algorithm, proving that it is unbiased, and establishing quantitative bounds on its variance in the singlestream setting. (5) Experimental results showing that the Alpha Algorithm instantiation achieves a novel tradeoff between accuracy, space usage, update speed, and applicability.
3 Analysis of the ThetaSketch Framework
Section Outline. Section 3.1 shows that KMV and Adaptive Sampling are both instantiations of the ThetaSketch Framework. Section 3.2 defines Goodness. Sections 3.3 and 3.4 prove that the TCF’s that instantiate behavior identical to and both satisfy Goodness. Section 3.5 proves that if a framework instantiation’s TCF satisfies Goodness, then so does the TCF that is implicitly applied to the union stream via the composition of the instantiation’s base algorithm and the function (). Section 3.6 proves that the estimator for returned by () is unbiased when applied to any thetasketch produced by a TCF satisfying Goodness. Section 3.7 defines monotonicity and shows that Goodness and monotonicity together imply variance bounds on . Section 3.8 explains how to tweak the ThetaSketch Framework to handle set intersections and other set operations on streams. Finally, Section 3.9 describes the variant of KMV.
3.1 Example Instantiations
Define to be the smallest unique hash value in (the hashed version of the input stream). The following is an easy observation.
Observation 3.1.
When the ThetaSketch Framework is instantiated with the TCF , the resulting instantiation is equivalent to the algorithm outlined in Section 2.2.
Let be any real value in . For any , define to be the largest value of (with a nonnegative integer) that is less than .
3.2 Definition of Goodness
The following circularity is a main source of technical difficulty in analyzing theta sketches: for any given identifier in a stream , whether its hashed value will end up in a sketch’s sample set depends on a comparison of versus a threshold that depends on itself. Adapting a technique from [9], we partially break this circularity by analyzing the following infinite family of projections of a given threshold choosing function .
Definition 3.3 (Definition of FixAllButOne Projection).
Let be a threshold choosing function. Let be one of the unique identifiers in a stream . Let be a fixed assignment of hash values to all unique identifiers in except for . Then the fixallbutone projection of is the function that maps values of to thetasketch thresholds via the definition where is the obvious combination of and .
[9] analyzed similar projections under the assumption that the base algorithm is specifically (a weighted version of) KMV; we will instead impose the weaker condition that every fixallbutone projection satisfies Goodness, defined below.^{5}^{5}5We chose the name Goodness due to the reference to FixAllButOne Projections.
Definition 3.4 (Definition of Goodness for Univariate Functions).
A function satisfies Goodness iff there exists a fixed threshold such that:
(2)  
(3) 
Figure 1 contains six examples of hypothetical projections of TCF’s. Four of them satisfy Goodness; the other two do not.
Condition 3.5 (Definition of Goodness for TCF’s).
A TCF satisfies Goodness iff for every stream containing unique identifiers, every label , and every fixed assignment of hash values to the identifiers in , the fixallbutone projection satisfies Definition 3.4.
3.3 TCF of Satisfies Goodness
The following theorem shows that the TCF used in KMV satisfies Goodness.
Theorem 3.6.
If , then every fixallbutone projection of satisfies Goodness.
3.4 TCF of Satisfies Goodness
The following theorem shows that the TCF used in Adaptive Sampling satisfies Goodness.
Theorem 3.7.
If , then every fixallbutone projection of satisfies Goodness.
3.5 Goodness Is Preserved by the Function
Next, we show that if a framework instantiation’s TCF satisfies Goodness, then so does the TCF that is implicitly being used by the thetasketch construction algorithm defined by the composition of the instantiation’s base sampling algorithms and the function (). We begin by formally extending the definition of a fixallbutone projection to cover the degenerate case where the label isn’t actually a member of the given stream .
Definition 3.8.
Let be a stream containing identifiers. Let be a label that is not a member of . Let the notation refer to an assignment of hash value to all identifiers in . For any hash value of the nonmember label , define the value of the “fixallbutone” projection to be the constant .
Theorem 3.9.
If the threshold choosing functions of the base algorithms used to create sketches of streams all satisfy Condition 3.5, then so does the TCF:
(4) 
that is implicitly applied to the union stream via the composition of those base algorithms and the procedure ().
Proof.
Let be any specific fixallbutone projection of the threshold choosing function defined by Equation (4). We will exhibit the fixed value that causes (2) and (3) to be true for .
The projection is specified by a label , and a set of fixed hash values for the identifiers in . For each , those fixed hash values induce a set of fixed hash values for the identifiers in . The combination of and then specifies a projection of . Now, if , this is a fixallbutone projection according to the original Definition 3.3, and according to the current theorem’s precondition, this projection must satisfy Goodness for univariate functions. On the other hand, if , this is a fixallbutone projection according to the extended Definition 3.8, and is therefore a constant function, and therefore satisfies Goodness. Because the projection satisfies Goodness either way, there must exist a fixed value such that Subconditions (2) and (3) are true for .
We now show that the value causes Subconditions (2) and (3) to be true for the projection , thus proving that this projection satisfies Goodness.
To show: . The condition implies that for all , . Then, for all , by Subcondition (2) for the various . Therefore, , where the last step is by Eqn (4). This establishes Subcondition (2) for the projection .
To show: . Because , there exists a . By Subcondition (3) for this , we have . By Eqn (4), we then have , thus establishing Subcondition (3) for .
Finally, because the above argument applies to every projection of , we have proved the desired result that satisfies condition 3.5. ∎
3.6 Unbiasedness of ()
We now show that Goodness of a TCF implies that the corresponding instantiation of the ThetaSketch Framework provides unbiased estimates of the number of unique identifiers on a stream or on the union of multiple streams.
Theorem 3.10.
Let be a stream containing unique identifiers, and let be a property evaluating to on an arbitrary subset of the identifiers. Let denote a random hash function. Let be a threshold choosing function that satisfies Condition 3.5. Let denote a sketch of created by , and as usual let denote the subset of hash values in whose corresponding identifiers satisfy . Then
Theorems 3.9 and 3.10 together imply that, in the multistream setting, the estimate for output by the ThetaSketch Framework is unbiased, assuming the base sampling schemes () each use a TCF satisfying Goodness.
Proof.
Let be a stream, and let be a Threshold Choosing Function that satisfies Goodness. Fix any . For any assignment of hash values to identifiers in , define the “peridentifier estimate” as follows:
(5) 
Because satisfies Goodness, there exists a fixed threshold for which it is a straightforward exercise to verify that:
(6) 
Now, conditioning on and taking the expectation with respect to :
(7) 
Since Equation (7) establishes that when conditioned on each , we also have when the expectation is taken over all . By linearity of expectation, we conclude that ∎
Is Goodness Necessary for Unbiasedness? Here we give an example showing that Goodness cannot be substantially weakened while still guaranteeing unbiasedness of the estimate returned by the ThetaSketch Framework. By construction, the following threshold choosing function causes the estimator of the ThetaSketch Framework to be biased upwards.
(8) 
Therefore, by the contrapositive of Theorem 3.10, it cannot satisfy Condition 3.5. It is an interesting exercise to try to establish this fact directly. It can be done by exhibiting a specific target size , stream , and partial assignment of hash values such that no fixed threshold exists that would satisfy (2) and (3). Here is one such example: , .
The nonexistence of the required fixed threshold is proved by the above plot of . The only value of that would satisfy subcondition (2) is 0.2. However, that value does not satisfy (3), because for .
3.7 Goodness and Monotonicity Imply Variance Bound
As usual, let be the union of data streams. Our goal in this section is to identify conditions on a threshold choosing function which guarantee the following: whenever the ThetaSketch Framework is instantiated with a TCF satisfying the conditions, then for any property , the variance of the estimator obtained from the ThetaSketch Framework is bounded above by the variance of the estimator obtained by running () on the stream obtained by concatenating .
It is easy to see that Goodness alone is not sufficient to ensure such a variance bound. Consider, for example, a TCF that runs KMV on a stream unless it determines that , for some fixed value , at which point it sets to (thereby causing () to sample all elements from ). Note that such a base sampling algorithm is not implementable by a sublinear space streaming algorithm, but nonetheless satisfies Goodness. It is easy to see that such a base sampling algorithm will fail to satisfy our desired comparative variance result when run on constituent streams satisfying for all , and . In this case, the variance of will be positive, while the variance of the estimator obtained by running directly on will be 0.
Thus, for our comparative variance result to hold, we assume that satisfies both Goodness and the following additional monotonicity condition.
Condition 3.11 (Monotonicity Condition).
Let be any three streams, and let denote their concatenation. Fix any hash function and parameter . Let , and . Then .
Theorem 3.12.
Suppose that the ThetaSketch Framework is instantiated with a TCF that satisfies Condition 3.5 (Goodness), as well as Condition 3.11 (monotonicity). Fix a property , and let , …, be input streams. Let denote the union of the distinct labels in the input streams. Let denote the concatenation of the input streams. Let , and let denote the estimate of obtained by evaluating . Let