The Power of Two Choices with Simple Tabulation

The Power of Two Choices with Simple Tabulation

Søren Dahlgaard, Mathias Bæk Tejs Knudsenfootnotemark:  , Eva Rotenberg, and Mikkel Thorup11footnotemark: 1
University of Copenhagen
{soerend,knudsen,roden,mthorup}@di.ku.dk
Research partly supported by Mikkel Thorup’s Advanced Grant from the Danish Council for Independent Research under the Sapere Aude research carrier programme.Research partly supported by the FNU project AlgoDisc - Discrete Mathematics, Algorithms, and Data Structures.
Abstract

The power of two choices is a classic paradigm for load balancing when assigning balls to bins. When placing a ball, we pick two bins according to two hash functions and , and place the ball in the least loaded bin. Assuming fully random hash functions, when , Azar et al. [STOC’94] proved that the maximum load is with high probability.

In this paper, we investigate the power of two choices when the hash functions and are implemented with simple tabulation, which is a very efficient hash function evaluated in constant time. Following their analysis of Cuckoo hashing [J.ACM’12], Pǎtraşcu and Thorup claimed that the expected maximum load with simple tabulation is . This did not include any high probability guarantee, so the load balancing was not yet to be trusted.

Here, we show that with simple tabulation, the maximum load is with high probability, giving the first constant time hash function with this guarantee. We also give a concrete example where, unlike with fully random hashing, the maximum load is not bounded by , or even with high probability. Finally, we show that the expected maximum load is , just like with fully random hashing.

1 Introduction

Consider the problem of placing balls into bins. If the balls are placed independently and uniformly at random, it is well known that the maximum load of any bin is with high probability (whp) [7]. Here, by high probability, we mean probability for arbitrarily large . An alternative variant chooses possible bins per ball independently and uniformly at random, and places the ball in the bin with the lowest load (breaking ties arbitrarily). It was shown by Azar et al. [1] that with this scheme the maximum load, surprisingly, drops to whp. When , this is known as the power of two choices paradigm. Applications are surveyed in [9, 10].

Here, we are interested in applications where the two bins are picked via hash functions and , so that the two choices for a ball, or key, can be recomputed. The most obvious such application is a hash table with chaining. In the classic hash table by chaining (see e.g. [8]), keys are inserted into a table using a hash function to decide a bin. Collisions are handled by making a linked list of all keys in the bin. If we insert keys into a table of size and the hash function used is perfectly random, then the longest chain has length whp. If we instead use the two-choice paradigm, placing each key in the shortest of the two chains selected by the hash functions, then the maximum time to search for a key in the two selected chains is whp, but this assumes that the hash functions and are truly random, which is not realistic. No constant time implementation of these hash functions (using space less than the size of the universe) was known to yield maximum load with high probability (see the paragraph Alternatives below).

In this paper we consider the two-choice paradigm using simple tabulation hashing dating back to Zobrist [21]. With keys from a universe , we view a key as partitioned into characters from the alphabet . For each character position , we have an independent character table assigning random -bit hash values to the characters. The -bit hash of a key is computed as where denotes bit-wise XOR. This takes constant time. In [15], with 8-bit characters, this was found to be as efficient as two multiplications over the same domain, e.g., 32 or 64 bit keys. Pǎtraşcu and Thorup [15] have shown that simple tabulation, which is not even -independent in the classic notion of Carter and Wegman [2], has many desirable algorithmic properties. In particular, they show that the error probability with cuckoo hashing [14] is , and they claim (no details provided) that their analysis can be extended to give an bound on the expected maximum load in the two-choice paradigm. For the expected bound, they can assume that cuckoo hashing doesn’t fail. However, [15] present concrete inputs for which cuckoo hashing fails with probability , so this approach does not work for high probability bounds.

Results

In this paper, we show that simple tabulation works almost as well as fully random hashing when used in the two-choice paradigm. Similar to [15], we consider the bipartite case where and hash to different tables. Our main result is that simple tabulation gives maximum load with high probability. This is the first result giving this guarantee of evaluation time for any practical hash function.

Theorem 1.

Let and be two independent random simple tabulation hash functions. If balls are put into two tables of bins sequentially using the two-choice paradigm with and , then for any constant , the maximum load of any bin is with probability .

We also prove the following result regarding the expected maximum load, improving on the expected bound of Pǎtraşcu and Thorup [15].

Theorem 2.

Let and be two independent random simple tabulation hash functions. If balls are put into two tables of bins sequentially using the two-choice paradigm with and , then the expected maximum load is at most .

In contrast to the positive result of Theorem 1, we also show that for any there exists a set of keys such that the maximum load is with probability for some . This shows that Theorem 1 is asymptotically tight and that unlike the fully random case, is not the right high probability bound for the maximum load.

Alternatives

It is easy to see that -independence suffices for the classic analysis of the two-choice paradigm [1] placing balls in bins. Several methods exist for computing such highly independent hash functions in constant time using space similar to ours [4, 17, 18], but these methods rely on a probabilistic construction of certain unbalanced constant degree expanders, and this only works with parameterized high probability in . By this we mean that to get probability , we have to parameterize the construction of the hash function by , which results in an evaluation time. In contrast the result of Theorem 1 works for any constant without changing the hash function. Moreover, for -independence, the most efficient method is the double tabulation from [18], and even if we are lucky that it works, it is still an order of magnitude slower than simple tabulation.

One could also imagine using the uniform hashing schemes from [5, 13]. Again, we have the problem that the constructions only work with parameterized high probability. The constructions from [5, 13] are also worse than simple tabulation in that they are more complicated and use space.

In a different direction, Woelfel [20] showed how to guarantee maximum load of using the hash family from [6]. This is better than our result by a constant factor and matches that of truly random hash functions. However, the result of [20] only works with parameterized high probability. The hash family used in [20] is also slower and more complicated to implement than simple tabulation. More specifically, for the most efficient implementation of the scheme in [20], Woelfel suggests using the tabulation hashing from [19] as a subroutine. However, the tabulation hashing of [19] is strictly more complicated than simple tabulation, which we show here to work directly with non-parameterized high probability.

Finally, it was recently shown by Reingold et al. [16] how to guarantee a maximum load of whp. using the hash functions of [3]. These functions use a seed of random bits and can be evaluated in time. With simple tabulation, we are not so concerned with the number of random bits, but note that the character tables could be filled with an -independent pseudo-random number generator. The main advantage of simple tabulation is that we have constant evaluation time. As an example using the result of [16] on the previously described example of a hash table with chaining would give lookup time instead of .

Techniques

In order to show Theorem 1, we first show a structural lemma, showing that if a bin has high load, the underlying hash graph must either have a large component or large arboricity. This result is purely combinatorial and holds for any choice of the hash functions and . We believe that this result is of independent interest, and could be useful in proving similar results for other hash functions. In order to show Theorem 2, we use Theorem 1 combined with the standard idea of bounding the probability of a large binomial tree occurring in the hash graph. An important trick in our analysis for the expected case is to only consider a pruned binomial tree where all degrees are large.

It remains a major open problem what happens for balls, and it does not seem like current techniques alone generalize to this case without the assumption that the hash functions are fully random. We do not know of any practical hash functions that guarantee that the difference between the maximum and the average load is with high probability when —not even -independence appears to suffice for this case.

Structure of the paper

In Section 2 we introduce well-known results and notation used throughout the paper. In Section 3, we first show that we cannot hope to get maximum load whp for simple tabulation. We then show a structural lemma regarding the arboricity of the hash graph and maximum load of any bin. Finally, we use this lemma to prove Theorem 1. In Section 4 we prove Theorem 2. The proofs of Sections 4 and 3 rely heavily on a few structural lemmas regarding the dependencies of keys with simple tabulation. The proofs of these lemmas are included in Section 5.

2 Preliminaries

2.1 Simple Tabulation

Let us briefly review simple tabulation hashing. The goal is to hash keys from some universe into some range (i.e. hash values are bit numbers for convenience). In tabulation hashing we view a key as a vector of characters from the alphabet , i.e. . We generally assume to be a small constant.

In simple tabulation hashing we initialize independent random tables . The hash value is then computed as

(1)

where denotes the bitwise XOR operation. This is a well known scheme dating back to Zobrist [21]. Simple tabulation is known to be just -independent, but it was shown in [15] to have much more powerful properties than this suggests. This includes fourth moment bounds, Chernoff bounds when distributing balls into many bins, and random graph properties necessary for cuckoo hashing.

Notation

We will now recall some of the notation used in [15]. Let be a set of keys. Denote by the projection of on the th character, i.e. . We also use this notation for keys, so . A position character is an element of . Under this definition a key can be viewed as a set of position characters . Furthermore we assume that is defined on position characters as . This definition extends to sets of position characters in a natural way by taking the XOR over the hash of each position character.

Dependent keys

That simple tabulation is not -independent implies that there exists keys such that for any choice of , is dependent on , , . However, contrary to e.g. polynomial hashing this is not the case for all -tuples. Such key dependences in simple tabulation can be completely classified. We will state this as the following lemma, first observed in [19].

Lemma 1 (Thorup and Zhang).

Let be keys in . If are dependent, then there exists an such that each position character of appears an even number of times.

Conversely, if each position character of appears an even number of times, then are dependent and, for any ,

This means that if a set of keys has symmetric difference , then it is dependent. Throughout the paper, we will denote the symmetric difference between the position characters of as .

In [5], the following lemma was shown:

Lemma 2 ([5]).

Let be a subset with elements. The number of -tuples such that

is at most . (Where )

In order to prove the main results of this paper, we prove several similar lemmas retaining to the number of tuples with a certain number of dependent keys.

2.2 Two Choices

In the two-choice paradigm, we are distributing balls (keys) into bins. The keys arrive sequentially, and we associate with each key two random bins and according to hash functions and . When placing a ball, we pick the bin with the fewest balls in it, breaking ties arbitrarily. If and are perfectly random hash functions the maximum load of any bin is known to be whp. if [1].

Definition 1.

Given hash functions (as above), let the hash graph denote the graph with bins as vertices and an edge between for each .

In this paper, we assume that and map to two disjoint tables, and the graph can thus be assumed to be bipartite. This is a standard assumption, see e.g. [15], and is actually preferable in the distributed setting. We note that the proofs can easily be changed such that they also hold when and map to the same table.

Definition 2.

The hash-graph may be decomposed into a series of nested subgraphs with edge-set , which we will call the hash-graph at the time . Similarly, the load of a vertex at the time is well-defined.

Cuckoo hashing

Similar to the power of choice hashing, another scheme using two hash functions is cuckoo hashing [14]. In cuckoo hashing we wish to place each ball in one of two random locations without collisions. This is possible if and only if no component in the hash graph has more than one cycle. Pǎtraşcu and Thorup [15, Thm. 1.2] proved that with simple tabulation, this obstruction happens with probability , and we shall use this in our analysis of the expected maximum load for theorem 2.

2.3 Graph terminology

The binomial tree of order is a single node. The binomial tree of order is a root node, which children are binomial trees of order . A binomial tree of order has nodes and height .

The arboricity of a graph is the minimum number of spanning forests needed to cover all the edges of the graph. As shown by Nash-Williams [12, 11], the arboricity of a graph equals

3 Maximum load with high probability

This section is dedicated to proving theorem 1. The main idea of the proof is to show that a hash graph resulting in high maximum load must have a component which is either large or has high arboricity. We then show that each of these cases is very unlikely.

As a negative result, we will first observe, that we cannot prove that the maximum load is or even whp. when using simple tabulation.

Observation 3.1.

Given , there exists an ordered set consisting of keys, such that when they are distributed into bins using hash values from simple tabulation the max load is with probability .

Proof.

Consider now the set of keys consisting of keys. For each of the positions the probability that all the position characters on position hash to the same value is . So with probability this happens for all positions . This happens for both hash functions with probability . In this case is only dependent on . Order the keys lexicographically and insert them into the bins. If balls are distributed independently and uniformly at random to bins the maximum load would be with probability . (This can be proved along the lines of [1, Thm. 3.2].) If we had exactly copies of independent and random keys the maximum load would be at least times larger than if we had had independent and random keys. The latter is at least with probability .

Since there are copies of independent and uniformly random hash values we conclude that the maximum load is at least with probability under the assumption that is constant for any . Since the latter happens with probability the proof is finished.            

We will now show that a series of insertions inducing a hash graph with low arboricity and small components cannot cause a too big maximum load. Note that this is the case for any hash functions and not just simple tabulation.

Lemma 3.

Consider the process of placing some balls into bins with the two choice paradigm, and assume that some bin gets load . Then there exists a connected component in the corresponding hash graph with nodes and arboricity such that:

Proof.

Let be the node in the hash graph corresponding to the bin with load . Let and define , for , in the following way: For each bin of , add the edge corresponding to the st ball landing in to the set . Define to be the endpoints of the edges in (see fig. 1 for a visualization).

Figure 1: A visualisation of the sets .

It is clear, that each bin of must have a load of at least . Note that the definition implies that and . For each , let the ’th load-graph of denote the subgraph . Let be defined as the following lower bound on the arboricity of this subgraph:

Let , then is a lower bound on the arboricity of . Now note that for each :

Since for each this means that:

By an easy induction , and therefore . The connected component that contains contains at least nodes, has arboricity , and:

     

     

Our approach is now to 1) fixing the hash graph, 2) observe that the hash graph must have a component with a certain structure (in this case high arboricity or big component), 3) Find a large set of independent keys, in this component, 4) bound the probability that such a component could have occurred using .

In order to perform step 4 above we will need a way to bound the number of sets, , which have many dependent keys. This is captured by the following lemma, which is proved in Section 5.

Lemma 4.

Let be a subset with elements and fix and such that . The number of -tuples for which there exists distinct , which are dependent on is no more than:

where the constant in the -notation is dependent on .

The goal is now to use the following lemma several times.

Lemma 5.

Let with , and let be two independent simple tabulation hash functions. Fix some integer . If , then the maximum load of any bin when assigning keys using the two-choice paradigm is with probability .

Proof.

Fix the hash values of all the keys and consider the hash graph. Note that there is a one-to-one correspondence between the edges and the keys and we will not distinguish between the two in this proof. Consider any connected subgraph in the hash graph. We wish to argue that cannot be too big or have too high arboricity. In order to do this, we construct a set of independent edges contained in . Initially let for some edge in in . At all times we maintain the set of keys which are dependent on the keys in . Note that . The set is constructed iteratively in the following way: If there exists an edge that is incident to an edge in add to . Otherwise, if there exists an edge , which is incident to an edge in , add to . If neither type of edge exists we do not add more edges to . Note that in this case C = Y.

At any point we can partition the edges of into connected components , such that is the component of the initial edge of . For each we let be an edge incident to (such an edge must exist by the definition above). Order the components such that . For a visualisation of fig. 2 can be consulted. Intuitively, since the s cannot be chosen in too many ways, the “first node” of each cannot be chosen in too many ways, and thus it is not a problem that the set is not necessarily connected.

Figure 2: A visualization of the process. correspond to components contained in , and the red lines are the corresponding edges .

We stop the algorithm when either or . We will show that the probability that this can happen in the hash graph is bounded by . The two cases are described below and the proof of each case is ended with a .

The algorithm stops because : In this case we know that since the algorithm has not stopped earlier and only grows by one in each step. Fix the size and the number of components . We wish to bound the number of ways could have been chosen. First we bound the number of ways we can choose the subgraphs – i.e. the edges, nodes, and keys corresponding to edges. Let be the number of nodes in the subgraph . We can choose the structure of a spanning tree in each of in no more than ways. Let be the total number of nodes. Then this places of the edges and it remains to place edges, which can be done in at most ways. Similarly, the number of ways that the nodes can be chosen is at most by arguing in the following manner: For each component we can describe one node by referring to and which endpoint the node is at (these are the red nodes in Figure 2). Thus we can describe of the nodes in at most ways, where was the set before the addition of the last edge, so . These nodes can be picked in at most ways, since there are at most nodes in . The remaining nodes can be chosen in no more than ways. Assuming that is larger than a constant we know by lemma 4 that the number of ways to choose the keys in (including the order) is bounded by . Hence for a fixed the total number of ways to choose is at most:

For each of the independent keys we fix hash values, so the probability that those values occur is at most . Thus the total probability that we can find such for fixed values of is at most:

Since there are at at most ways to choose we can bound the probability by a union bound and get .

The algorithm stops because : Let have the same meaning as before. The same line of argument (without using lemma 4) shows that the number of ways to choose is bounded by

Since we know that and a union bound over all choices of suffices.

Along the same lines we can show that with probability . Here, the idea is that we need to place additional keys when the spanning trees are fixed. Such a key and placement can be chosen in at most ways, but it happens with probability at most due to the independence of the keys.

Now, assume there exists a component with arboricity and choose a subgraph such that . Consider the algorithm constructing restricted to (and define , , and analogously). If the algorithm is not stopped early we know that contains the edges of , so and thus . This implies that , i.e. every component has arboricity with probability .

From the analysis above we get that there exists no component with more than nodes with probability . Combining this with lemma 3 we now conclude that with probability the maximum load is upper bounded by:

     

     

Proof of Theorem 1.

Divide the balls into portions of size , apply Lemma 5 to each portion, and take a union bound.            

4 Bounding the expected maximum load

This section is dedicated to proving theorem 2. The main idea is to bound the probability that a big binomial tree appears in the hash graph. A crucial point of the proof is to consider a subtree of the binomial tree which is chosen such that the number of leaves are much larger than the number of internal nodes.

First of all note that by theorem 1, the probability that the maximum load is more than is for some constant . Hence it suffices to prove that the probability that the maximum load is larger than is at most for some constant depending on and .

Observation 4.1.

If there exists a bin with load at least then either there is a component with more edges than nodes, or the binomial tree is a subgraph of the hash graph.

Proof.

Assume no component has more edges than nodes. Then, removing at most one edge from each component yields a forest. One edge per component will at most increase the load by , so consider the remaining forest.

Consider now the order in which the keys are inserted, and use induction on this order. Define to be the graph after the th key is inserted. The induction hypothesis is that if a bin has load , then it is the root in a subtree which is . For it is easy to see. Consider now the addition of the th key and assume that the hypothesis holds. Assume that the added key corresponds to the edge and that the load of bin increases to . Since there are no cycles, node must have edges , and by the induction hypothesis are roots of disjoint binomial trees , so is the root of a .            

Let double cycle denote any of the minimal obstructions described in [15], that is, either a cycle and a path between two vertices of the cycle, or two cycles linked by a path. Note that any connected graph with two cycles (not necessarily disjoint) contains a double cycle as a subgraph.

Observation 4.2.

If there exists a bin with load at least , then either the binomial tree is a subgraph of the hash graph, or a double cycle with at most edges is a subgraph of the hash graph.

Proof.

As in the proof of Lemma 3, let be the node with load , and let be the st load-graph of .

If the st load-graph of has no more edges than vertices, it must contain as a subgraph, as noted in creftypecap 4.1. Otherwise, take as root and consider a breadth first spanning tree, . It must have height at most , and there must be two edges of the combined load-graph not in . Furthermore, these edges cannot have both endpoints have maximal distance from . Thus, the union has at most edges and must contain a double cycle as a subgraph.            

We are now ready to prove theorem 2.

Proof of theorem 2..

if we know from [15, Thm. 1.2] that no component of the hash graph contains a double cycle with probability . Looking into the proof we see that there exists no double cycle consisting of at most edges with probability even when . In the terminology of [15], bits per edge is saved in the encoding of the hash-values. But when we add bits to the encoding instead. If the double cycle consist of edges this is extra bits in the encoding, i.e. that the bound on the probability is multiplied with . This means that we only need to bound the probability that there exists a binomial tree , because, according to creftypecap 4.2, any bin with load will either imply the existence of in the hash graph or the existence of a double cycle consisting of edges, and the latter happens with probability .

Say that the hash graph contains a binomial tree . Consider the subtree defined by removing the children of all nodes that have less than children, where is some constant to be defined (see fig. 3). Note that has edges. We will now follow the same approach as in the proof of Theorem 1, in the sense that we construct as set of independent keys from , and show that this is unlikely. We construct the ordered set by traversing in the following way: Order the edges in increasing distance from the root and on each level from left to right. Traverse the edges in this order. A given edge is added to the ordered set if the following two requirements are fulfilled:

  • After the edge is added corresponds to a connected subgraph of .

  • The key corresponding to the edge is independent of all the keys corresponding to the edges in .

A visualization of the set can be seen in fig. 3.

Figure 3: Example of and the corresponding set . The dashed edges correspond to key dependencies at the time the edge is considered in the order. This example would correspond to case 2.

We will think of as a set of edges, but also as a set of independent keys. The idea is to bound the probability that we could find such a set . We will split the proof into four cases depending on , and each will end with a .

Case 1: : In this case every edge of the tree is independent, and there are at most different ways to choose the ordered set . Note that there are groups of leaves which have the same parent. The set corresponds to the same subgraph of the hash graph regardless of the ordering of these leaves. Since we only want to bound the probability that we can find such , we can thus chose the edges of in at most ways. For a given choice of there are equations which must be fulfilled where and are keys in . Since the keys in are independent, the probability that this happens for a given is at most . By a union bound on all the choices of the probability that such an exists is at most:

We now pick and such that and . It then follows that and the probability is bounded by .

For case 2 and case 3, we will use the following lemma, which is proved in section 5.

Lemma 6.

Let be a subset with elements and fix such that . The number of -tuples for which there exists such that is dependent of is no more than:

Case 2: All the edges incident to the root lie in : Let be defined in a similar manner as : Order the edges in increasing distance from the root and on each level from left to right as before. Traverse the edges in this order, and add the edges to if the corresponding key is independent of the keys in . However, stop this traversal the first time a dependent key occurs. In this way will be an ordered subset of and the tree-structure will only depend on . Fix this value . Since there is a key which is dependent on the keys in there are at most ways to choose by Lemma 6 assuming that , i.e. assuming that is larger than some constant depending on .

Every internal node of has exactly children that are leaves. Therefore, there can be at most one node in having less than children that are leaves and belong to . Let denote the internal nodes in , where is the number of internal nodes. Let denote the number of children of that are leaves. Similar to case 1, the structure of is independent of the order of the leaves with the same parent. Therefore can be chosen in at most ways. Since we see that:

Letting the concavity of combined with Jensen’s inequality yields:

At most one of the ’s can be smaller than , so wlog. assume that . The total number of nodes must be at least , i.e. giving . Since we see that:

Where the last inequality holds assuming that (and hence ) is larger than a constant. Since we see that:

Assume that is chosen such that . The number of cases that we need to consider is then at most:

Since is a tree there are equalities on the form where that must be satisfied if occurs. Since we know the tree structure from knowing there are at most two ways two choose these equalities. This means that the probability that a specific occurs is bounded by . For a fixed the probability that there exists with elements is therefore bounded by:

A union bound over all now yields the desired upper bound:

Case 3: Not all, but at least edges incident to the root lie in : Let be the set of independent keys adjacent to the root, and set . By Lemma 6, can be chosen in no more than ways since there must exist a key (corresponding to an edge incident to the root) which is dependent on the keys in and the order of the keys are irrelevant. Since all the keys in are independent, the probability that or are the same for all the keys is at most . So the probability that such a can be found is at most:

For case 4, we will use the following generalization of Lemma 6, which is proved in section 5.

Lemma 7.

Let with and fix such that . The number of -tuples for which there exists distinct for such that each is dependent on is at most

Case 4: There are less than edges incident to the root in : Let be the set of keys corresponding to the edges from incident to the root and let . Since the other keys incident to the root must be dependent on the keys from , Lemma 7 states that can be chosen in at most ways. Since all the keys in are independent the probability that or are the same for all the keys is at most . Thus, the probability of such a set occurring is bounded by:

This covers all cases for the set .            

Consider the case of distributing balls into bins. Note that the proof actually gives an expected maximum load of if . However, this only matches the behaviour of truly random hash functions under the assumption that .

The same techniques can be used to show that -independent hash functions yield a maximum load of with high probability (this is essentially case 1 in the proof). This implies that -independence hashing is sufficient to give the same theoretical guarantees as truly random hash functions in the context of the power of two choices when .

5 Proofs of structural lemmas

In this section we prove the lemmas used in Sections 4 and 3.

The following lemma is a generalization of Lemma 2 and is proved in [5].

Lemma 8 ([5]).

Let be subsets of . The number of -tuples such that

(2)

is at most . (Where )

We use Lemma 8 in our proof of Lemma 7:

Proof of Lemma 7.

For each let