Non-Empty Bins with Simple Tabulation Hashing
We consider the hashing of a set with using a simple tabulation hash function and analyse the number of non-empty bins, that is, the size of . We show that the expected size of matches that with fully random hashing to within low-order terms. We also provide concentration bounds. The number of non-empty bins is a fundamental measure in the balls and bins paradigm, and it is critical in applications such as Bloom filters and Filter hashing. For example, normally Bloom filters are proportioned for a desired low false-positive probability assuming fully random hashing (see en.wikipedia.org/wiki/Bloom_filter). Our results imply that if we implement the hashing with simple tabulation, we obtain the same low false-positive probability for any possible input.
We consider the balls and bins paradigm where a set of balls are distributed into a set of bins according to a hash function . We are interested in questions relating to the distribution of , for example: What is the expected number of non-empty bins? How well is concentrated around its mean? And what is the probability that a query ball lands in an empty bin? These questions are critical in applications such as Bloom filters  and Filter hashing .
In the setting where is a fully random hash function, meaning that the random variables are mutually independent and uniformly distributed in , the situation is well understood. The random distribution process is equivalent to throwing balls sequentially into bins by for each ball choosing a bin uniformly at random and independently of the placements of the previous balls. The probability that a bin becomes empty is thus ; so the expected number of non-empty bins is exactly and, unsurprisingly, the number of non-empty bins turns out to be sharply concentrated around (see for example Kamath et al.  for several such concentration results).
In practical applications fully random hashing is unrealistic and so it is desirable to replace the fully random hash functions with realistic and implementable hash functions that still provide at least some of the probabilistic guarantees that were available in the fully random setting. However, as the mutual independence of the keys is often a key ingredient in proving results in the fully random setting most of these proofs do not carry over. Often the results are simply no longer true and if they are one has to come up with alternative techniques for proving them.
In this paper, we study the number of non-empty bins when the hash function is chosen to be a simple tabulation hash function [14, 21]; which is very fast and easy to implement (see description below in Section 1.1). We provide estimates on the expected size of which asymptotically match111Here we use “asymptotically” in the classic mathematical sense to mean equal to within low order terms, not just within a constant factor. those with fully random hashing on any possible input. To get a similar match within the classic -independence paradigm , we would generally need . For comparison, simple tabulation is the fastest known 3-independent hash function . We will also study how is concentrated around its mean.
Our results complements those from , which show that with simple tabulation hashing, we get Chernoff-type concentration on the number of balls in a given bin when . For example, the results from  imply that all bins are non-empty with high probability (whp) when . More precisely, for any constant , there exists a such that if , all bins are non-empty with probability . As a consequence, we only have to study for below. On the other hand,  does not provide any good bounds on the probability that a bin is non-empty when, say, . In this case, our results imply that a bin is non-empty with probability , as in the fully random case. The understanding we provide here is critical to applications such as Bloom filters  and Filter hashing , which we describe in section 2.1 and 2.2.
We want to emphasize the advantage of having many complementary results for simple tabulation hashing. An obvious advantage is that simple tabulation can be reused in many contexts, but there may also be applications that need several strong properties to work in tandem. If, for example, an application has to hash a mix of a few heavy balls and many light balls, and the hash function do not know which is which, then the results from  give us the Chernoff-style concentration of the number of light balls in a bin while the results of this paper give us the right probability that a bin contains a heavy ball. For another example where an interplay of properties becomes important see section 2.2 on Filter hashing. The reader is referred to  for a survey of results known for simple tabulation hashing, as well as examples where simple tabulation does not suffice and where slower more sophisticated hash functions are needed.
1.1 Simple tabulation hashing
Recall that a hash function is a map from a universe to a range chosen with respect to some probability distribution on the set of all such functions. If the distribution is uniform (equivalently the random variables are mutually independent and uniformly distributed in ) we will say that is fully random.
Simple tabulation was introduced by Zobrist . For simple tabulation and for some . The keys are viewed as vectors of characters with each . The simple tabulation hash function is defined by
where are independent fully random hash functions and where denotes the bitwise XOR. What makes it fast is that the character domains of are so small that they can be stored as tables in fast cache. Experiments in  found that the hashing of -bit keys divided into -bit characters was as fast as two -bit multiplications. Note that on machines with larger cache, it may be faster to use -bit characters. As useful computations normally involve data and hence cache, there is no commercial drive for developing processors that do multiplications much faster than cache look-ups. Therefore, on real-world processors, we always expect cache based simple tabulation to be at least comparable in speed to multiplication. The converse is not true, since many useful computations do not involve multiplications. Thus there is a drive to make cache faster even if it is too hard/expensive to speed up multiplication circuits.
Other important properties include that the character table lookups can be done in parallel and that when initialised the character tables are not changed. For applications such as Bloom filters where more than one hash function is needed another nice property of simple tabulation is that the output bits are mutually independent. Using -bit hash values is thus equivalent to using independent simple tabulation hash functions each with values in . This means that we can get independent -bit hash values using only lookups of -bit strings.
1.2 Main Results
We will now present our results on the number of non-empty bins with simple tabulation hashing.
The expected number of non-empty bins:
Our first theorem compares the expected number of non-empty bins when using simple tabulation to that in the fully random setting. We denote by the probability that a bin becomes non-empty and by the expected number of non-empty bins when balls are distributed into bins using fully random hashing.
Let be a fixed set of balls. Let be any bin and suppose that is a simple tabulation hash function. If denotes the probability that then
If we let depend on the hash of a distinguished query ball , e.g., , then the bound on above is replaced by the weaker .
The last statement of the theorem is important in the application to Bloom filters where we wish to upper bound the probability that for a query ball .
To show that the expected relative error is always small, we have to complement Theorem 1.1 with the result from  that all bins are full, whp, when for some large enough constant . In particular, this implies when . The relative error from Theorem 1.1 is maximized when is maximized, and with , it is bounded by . Thus we conclude:
Let be a fixed sets of balls and let be a simple tabulation hash function. Then .
Concentration of the number of non-empty bins:
We now consider the concentration of around its mean. In the fully random setting it was shown by Kamath et al.  that the concentration of around is sharp: For any it holds that
which for example yields that whp, that is, with probability ) for any choice of . Unfortunately we cannot hope to obtain such a good concentration using simple tabulation hashing. To see this, consider the set of keys for any constant , e.g. , and let be the event that for . This event occurs with probability . Now if occurs then the keys of all hash to the same value namely . Furthermore, these values are independently and uniformly distributed in for so the distribution of becomes identical to the distribution of non-empty bins when balls are thrown into bins using truly random hashing. This observation ruins the hope of obtaining a sharp concentration around and shows that the lower bound in the theorem below is best possible being the expected number of non-empty bins when balls are distributed into bins.
Let be a fixed sets of keys. Let be a simple tabulation hash function. Then whp
As argued above, the lower bound in Theorem 1.3 is optimal. Settling with a laxer requirement than high probability, it turns out however that is somewhat concentrated around . This is the content of the following theorem which also provides a high probability upper bound on .
Let be a fixed sets of keys. Let be a random simple tabulation hash function. For it holds that
The term in the second bound in the theorem may be unexpected but it has to be there (at least when ) as we will argue after proving the theorem.
Theorem 1.4 is proved using Azuma’s inequality (which we will state and describe later). It turns out that when one can obtain stronger concentration using a stronger martingale inequality. For intuition, the reader is encouraged to think of the fully random setting where balls are thrown sequentially into bins independently and uniformly at random: In this setting the allocation of a single ball can change the conditionally expected number of non-empty bins by at most and this is the type of observation that normally suggests applying Azuma’s inequality. However, when , it is unlikely that the allocation of a ball will change the conditional expectation of the number of non-empty bins by much — for that to happen the ball has to hit a bin that is already non-empty, and the probability that this occurs is at most . Using a martingale inequality by Mcdiarmid , that takes the variance of our martingale into consideration, one can obtain the following result which is an improvement over Theorem 1.4 when , and matches within -notation when .
Let be a fixed sets of keys. Let be a random simple tabulation hash function. Assume . For it holds that
The above bounds are unwieldy so let us disentangle them. First, one can show using simple calculus that when then . If we thus have that . To get a non-trivial bound from (1.3) we have to let and then . This means that (1.3) is trivial when as we can never have more than non-empty bins. For comparison, (1.1) already becomes trivial when .
Suppose now that . For a given put
for some sufficiently large . Then (1.3) gives that . It remains to understand : Assuming that , we have that . For comparison, to get the same guarantee on the probability using (1.1) we would have to put , which is a factor of larger.
Turning to (1.4), it will typically in applications be the term that dominates the bound. For a given we would choose to get .
1.3 Projecting into Arbitrary Ranges
Simple tabulation is an efficient hashing scheme for hashing into -bit hash values. But what do we do if we want hash values in where , say ? Besides being of theoretical interest this is an important question in several practical applications. For example, when designing Bloom filters (which we will describe shortly), to minimize the false positive probability, we have to choose the size of the filters such that . When has to be a power of two, we may be up to a factor of off, and this significantly affects the false positive probability. Another example is cuckoo hashing , which was shown in  to succeed with simple tabulation with probability when . If we have to choose as large as to apply this result, making it much less useful.
The way we remedy this is a standard trick, see e.g. . We choose such that , and hash in the first step to -bit strings with a simple tabulation hash function . Usually suffices and then the entries of the character tables only becomes twice as long. Defining by our combined hash function is simply defined as . Note that is very easy to compute since we do just one multiplication and since the division by is just an -bit right shift. The only property we will use about is that it is most uniform meaning that for either, or . For example, we could also use defined by , but is much faster to compute. Note that if , then .
A priori it is not obvious that has the same good properties as “normal” simple tabulation. The set of bins can now be viewed as , so each bin consists of many “sub-bins”, and a result on the number of non-empty sub-bins does not translate directly to any useful result on the number of non-empty bins. Nonetheless, many proofs of results for simple tabulation do not need to be modified much in this new setting. For example, the simplified proof given by Aamand et al.  of the result on cuckoo hashing from  can be checked to carry over to the case where the hash functions are implemented as described above if is sufficiently large. We provide no details here.
For the present paper the relevant analogue to Theorem 1.1 is the following:
Let be a fixed set of balls, and let with . Suppose is a simple tabulation hash function. Define . If denotes the probability that , then
If we let (and hence ) depend on the hash of a distinguished query ball , then the bound on above is replaced by the weaker .
If we assume , say, and let be a bin of we obtain the following estimate on :
This is very close to what is obtained from Theorem 1.1 and to make the difference smaller we can increase further.
There are also analogues of Theorem 1.3, 1.4 and 1.5 in which the bins are partitioned into groups of almost equal size and where the interest is in the number of groups that are hit by a ball. To avoid making this paper unnecessarily technical, we refrain from stating and proving these theorems, but in Section 5 we will show how to modify the proof of Theorem 1.1 to obtain Theorem 1.6.
One natural alternative to simple tabulation is to use -independent hashing . Using an easy variation222Mitzenmacher and Vadhan actually estimate the probability of getting a false positive when using -independent hashing for Bloom filters, but this error probability is strongly related to the expected number of non-empty bins (in the fully random setting it is ). Thus only a slight modification of their proof is needed.of an inclusion-exclusion based argument by Mitzenmacher and Vadhan  one can show that if is odd and if the probability that a given bin is non-empty satisfies
and this is optimal, at least when is not too large, say — there exist two (different) -independent families making respectively the upper and the lower bound tight for a certain set of keys. A similar result holds when is even. Although approaches when increases, for and , we have a deviation by an additive constant term. In contrast, the probability that a bin is non-empty when using simple tabulation is asymptotically the same as in the fully random setting.
Another alternative when studying the number of non-empty bins is to assume that the input comes with a certain amount of randomness. This was studied in  too and a slight variation††footnotemark: of their argument shows that if the input has enough entropy the probability that a bin is empty is asymptotically the same as in the fully random setting even if we only use -independent hashing. This is essentially what we get with simple tabulation. However, our results have the advantage of holding for any input with no assumptions on its entropy. Now (1.5) also suggests the third alternative of looking for highly independent hash functions. For the expectation (1.5) shows that if we would need to get guarantees comparable to those obtained for simple tabulation. Such highly independent hash functions were first studied by Siegel , the most efficient known construction today being the double tabulation by Thorup  which gives independence using space and time . While this space and time matches that of simple tabulation within constant factors, it is slower by at least an order of magnitude. As mentioned in , double tabulation with 32-bit keys divided into 16-bit characters requires 11 times as many character table lookups as with simple tabulation and we lose the same factor in space. The larger space of double tabulation means that tables may expand into much slower memory, possibly costing us another order of magnitude in speed.
There are several other types of hash functions that one could consider, e.g., those from [6, 12], but simple tabulation is unique in its speed (like two multiplications in the experiments from ) and ease of implementation, making it a great choice in practice. For a more thorough comparison of simple tabulation with other hashing schemes, the reader is refered to .
Before proving our main results we describe two almost immediate applications.
2.1 Bloom Filters
Bloom filters were introduced by Bloom . We will only discuss them briefly here and argue which guarantees are provided when implementing them using simple tabulation. For a thorough introduction including many applications see the survey by Broder and Mitzenmacher . A Bloom filter is a simple data structure which space efficiently represents a set and supports membership queries of the form “is in ”. It uses independent hash functions and arrays each of bits which are initially all . For each we calculate and set the ’th bit of to noting that a bit may be set to several times. To answer the query “is in ” we check if the bits corresponding to are all , outputting “yes” if so and “no” otherwise. If we will certainly output the correct answer but if we potentially get a false positive in the case that all the bits corresponding to are set to by other keys in . In the case that the probability of getting a false positive is
which with fully random hashing is .
It should be noted that Bloom filters are most commonly described in a related though not identical way. In this related setting we use a single -bit array and let , setting the bits of corresponding to to for each . With fully random hashing the probability that a bit is set to is then and the probability of a false positive is thus at most . Despite the difference, simple calculus shows that and so
In particular if or if the number of filters is not too large (both being the case in practice) the failure probability in the two models are almost identical. We use the model with different tables each of size as this makes it very easy to estimate the error probability using Theorem 1.1 and the independence of the hash functions. We can in fact view as a map from to but having image in getting us to the model with just one array.
From Theorem 1.1 we immediately obtain the following corollary.
Let with and . Suppose we represent with a Bloom filter using independent simple tabulation hash functions . The probability of getting a false positive when querying is at most
At this point one can play with the parameters. In the fully random setting one can show that if the number of balls and the the total number of bins are fixed one needs to choose and such that in order to minimise the error probability (see ). For this, one needs and if is chosen so, the probability above is at most . In applications, is normally a small number like for a 0.1% false positive probability. In particular, , and then , asymptotically matching the fully random setting.
To resolve the issue that the range of a simple tabulation function has size but that we wish to choose , we choose such that and use the combined hash function described in Section 1.3. Now appealing to Theorem 1.6 instead of Theorem 1.1 we can again drive the false positive probability down to when .
The argument by Mitzenmacher and Vadhan  discussed in relation to (1.5) actually yields a tight bound on the probability of a false positive when using -independent hashing for Bloom filters. We do not state their result here but mention that when is constant the error probability may again deviate by an additive constant from that of the fully random setting. It is also shown in  that if the input has enough entropy we can get the probability of a false positive to match that from the fully random setting asymptotically even using -independent hashing, yet it cannot be trusted for certain types of input.
Now, imagine you are a software engineer that wants to implement a Bloom filter, proportioning it for a desired low false-positive probability. You can go to a wikipedia page (en.wikipedia.org/wiki/Bloom_filter) or a texbook like  and read how to do it assuming full randomness. If you read , what do you do? Do you set and cross your fingers, or do you pay the cost of a slower hash function with a larger , adjusting the false-positive probabilities accordingly? Which do you pick?
With our result, there are now hard choices. The answer is simple. We just have to add that everything works as stated for any possible input if the hashing is implemented with simple tabulation hashing (en.wikipedia.org/wiki/Tabulation_hashing) which is both very fast and very easy to implement.
2.2 Filter Hashing
In Filter hashing, as introduced by Fotakis et al. , we wish to store as many elements as possible of a set of size in hash tables . The total number of entries in the tables is at most and each entry can store just a single key. For we pick independent hash functions where is the number of entries in . The keys are allocated as follows: We first greedily store a key from in for each . This lets us store exactly keys. Letting be the so stored keys and the remaining keys, we repeat the process, storing keys in using etc.
An alternative and in practice more relevant way to see this is to imagine that the keys arrive sequentially. When a new key arrives we let be the smallest index such that is unmatched and store in that entry. If no such exists the key is not stored. The name Filter hashing comes from this view which prompts the picture of particles (the keys) passing through filters (the tables) being caught by a filter only if there is a vacant spot.
The question is for a given how few filters that are needed in order to store all but at most keys with high probability. Note that the remaining keys can be stored using any hashing scheme which uses linear space, for example Cuckoo hashing with simple tabulation [13, 14], to get a total space usage of .
One can argue that with fully random hashing one needs filters to achieve that whp at least keys are stored. To see that we can achieve this bound with simple tabulation we essentially proceed as in . Let be any constant and choose according to Theorem 1.3 so that if with and is a simple tabulation hash function, then with probability at least .
Let . For , we pick to be the largest power of two below . We then set , terminating when . Then is indexed by -bit strings — the range of a simple tabulation hash function . Letting be minimal such that we have that and as decreases by at least a factor of in each step, .
How many bins of get filled? Even if all bins from filters are non-empty we have at least balls left and so with probability the number of bins we hit is at least
Thus, with probability at least , for each , filter gets at least balls. Since , the number of overflowing balls is at most in this case. Assuming for example that , as would be the case in most applications, we get that the fraction of balls not stored is with probability at least .
The hashing scheme for Filter hashing described in  uses -independent polynomial hashing to achieve an overflow of at most balls. In particular the choice of hash functions depends on and becomes more unrealistic the smaller is. In contrast when using simple tabulation (which is only -independent) for Filter hashing we only need to change the number of filters, not the hashing, when varies. It should be mentioned that only filters are needed for the result in  whereas we need a constant factor more. It can however be shown (we provide no details) that we can get down to filters by applying (1.2) of Theorem 1.4 if we settle for an error probability of for a given constant .
Taking a step back we see the merits of a hashing scheme giving many complementary probabilistic guarantees. As shown by Pǎtraşcu and Thorup , Cuckoo hashing  implemented with simple tabulation succeeds with probability (for a recent simpler proof of this result, see Aamand et al. ). More precisely, for a set of balls, let be the least power of two bigger than . Allocating tables of size , and using simple tabulation hash functions , with probability Cuckoo hashing succeeds in placing the keys such that every key is found in either or . In case it fails, we just try again with new random .
We now use Cuckoo hashing to store the keys remaining after the filer hashing, appending the Cuckoo tables to the filter tables so that and for . Then if and only if for some , we have . We note that all these lookups could be done in parallel. Moreover, as the output bits of simple tabulation are mutually independent, the hash functions , , can be implemented as a single simple tabulation hash function and therefore all be calculated using just look-ups in simple tabulation character tables.
As in  we define a position character to be an element . Simple tabulation hash functions are initially defined only on keys in but we can extend the definition to sets of position characters by letting . This coincides with when the key is viewed as the set of position characters .
We start by describing an ordering of the position characters, introduced by Pǎtraşcu and Thorup  in order to prove that the number of balls hashing to a specific bin is Chernoff concentrated when using simple tabulation. If is a set of keys and is any ordering of the position characters we for define . Here we view the keys as sets of position characters. Further define to be the set of keys in containing as a position character. Pǎtraşcu and Thorup argued that the ordering may be chosen such that the groups are not too large.
Lemma 3.1 (Pǎtraşcu and Thorup ).
Let with . There exists an ordering of the position characters such that for all position characters . If is any (query) key in or outside , we may choose the ordering such that the position characters of are first in the order and such that for all position characters .
Let us throughout this section assume that is chosen as to satisfy the properties of Lemma 3.1. A set is said to be -bounded if for all . In other words no bin gets more than balls from .
Lemma 3.2 (Pǎtraşcu and Thorup ).
Assume that the number of bins is at least . For any constant , and all groups are -bounded with probability at least .
Lemma 3.3 (Pǎtraşcu and Thorup ).
Let be a fixed constant and assume that . For any constant no bin gets more than balls with probability at least .
Let us describe heuristically why we are interested in the order and its properties. We will think of as being uncovered stepwise by fixing only when has been fixed. At the point where is to be fixed the internal clustering of the keys in has been settled and acts merely as a translation, that is, as a shift by an XOR with . This viewpoint opens up for sequential analyses where for example it may be possible to calculate the probability of a bin becoming empty or to apply martingale concentration inequalities. The hurdle is that the internal clustering of the keys in the groups are not independent as the hash value of earlier position characters dictate how later groups cluster so we still have to come up with ways of dealing with these dependencies.
4 Proofs of main results
In order to pave the way for the proofs of our main results we start by stating two technical lemmas, namely Lemma 4.1 and 4.2 below. We provide proofs at the end of this section. Lemma 4.1 is hardly more than an observation. We include it as we will be using it repeatedly in the proofs of our main theorems.
Assume and are real numbers. Further assume that and . Then
If further for some real then
For the second lemma we assume that the set of keys has been partitioned into groups . Let denote the number of sets such that but , that is, the number of pairs of colliding keys internal to . Denote by the total number of collisions internal in the groups. The second lemma bounds the expected value of as well as its variance in the case where the groups are not too large.
Let with be partitioned as above. Suppose that there is an such that for all , . Then
For a given query ball and a bin , the upper bound on is also an upper bound on . For the variance estimate note that if in particular , then .
Proof of Theorem 1.1.
Let us first prove the theorem in the case where is a fixed bin not chosen dependently on the hash value of a query ball. If the result is trivial as then the stated upper bound is at least . Assume then that . Consider the ordering of the position characters obtained from Lemma 3.1 such that all groups have size at most . We will denote by the maximal possible group size.
We randomly fix the in the order obtained from not fixing before having fixed for all . If then and since for all only has to be fixed in order to settle . The number of different bins hit by the keys of when fixing is thus exactly the size of the set which is simply translated by an XOR with and for we have that is uniform in its range when conditioned on the values .
To make it easier to calculate the probability that we introduce some dummy balls. At the point where we are to fix we dependently on in any deterministic way choose a set of dummy balls, disjoint from , such that has size exactly . We will say that a bin is hit if either or there exists an such that for some . In the latter case we will say that is hit by a dummy ball. This modified random process can be seen as ensuring that when we are to finally fix the hash values of the elements of by the last translation with , we modify the group by adding dummy balls to ensure that exactly bins are hit by either a ball in or a dummy ball in . We let denote the total number of dummy balls.
Let denote the event that is hit and denote the event that is hit by a dummy ball. With the presence of the dummy balls, is easy to calculate:
Clearly so for a lower bound on it suffices to upper bound . Let denote the event that is hit by a dummy ball from . We can calculate . The conditional probability is exactly as the choice of only depends on the hash values and when translated by an XOR with the bin is hit with probability . It follows that and thus that . Finally the total number of dummy balls is upper bounded by the number of internal collisions in the groups, so Lemma 4.2 gives that . This gives the desired lower bound on (throwing away the factor of , in order to simplify the statement in the theorem).
For the upper bound note that so by Lemma 4.1
Using the inequality holding for and with and (note that as we assumed that ) we obtain that
as desired. The bound on follows immediately as .
Finally consider the case where is chosen conditioned on for a query ball and a bin . Here we may assume that as otherwise the claimed upper bound is at least . We choose the ordering such that the position characters of are first in the order and such that all groups have size at most which is possible by Lemma 3.1. Let denote the maximal possible group size. Introducing dummy balls the same way as before and repeating the arguments, the probability of the event that is hit satisfies
The desired upper bound follows immediately as .
For the lower bound we again let denote the event that is hit by a dummy ball and denote the event that is hit by a dummy ball from . Then