Linear Hashing is Awesome

Linear Hashing is Awesome

Mathias Bæk Tejs Knudsen Research partly supported by Advanced Grant DFF-0602-02499B from the Danish Council for Independent Research under the Sapere Aude research career programme and by the FNU project AlgoDisc - Discrete Mathematics, Algorithms, and Data Structures University of Copenhagen,
mathias@tejs.dk
Abstract

We consider the hash function where are chosen uniformly at random from . We prove that when we use in hashing with chaining to insert elements into a table of size the expected length of the longest chain is . The proof also generalises to give the same bound when we use the multiply-shift hash function by Dietzfelbinger et al. [Journal of Algorithms 1997].

1 Introduction

In this paper we study the hash function (where ) defined by , where are chosen uniformly at random from . Here, is a prime and . We assume that we have a set of keys with and use to assign a hash value to each key . We are interested in the frequency of the most popular hash value, i.e. we study the random variable defined by

 M(h,X)=maxy∈[m]|{x∈X∣h(x)=y}|. (1)

In Theorem 1 we prove that . We also consider the hash function defined by , where are powers of , and is chosen uniformly at random among the odd numbers from . The function was first introduced by Dietzfelbinger et al. [3]. In Theorem 2 we prove that it also holds that .

We note that when we use in hashing with chaining, is the size of the largest chain. When scanning the hash table for an element the expected time used is and the worst case time is at most .

1.1 Related work

It is folklore that the size of the largest chain is and this bounds hold for any -independent hash function.

Alon et al. [1] considers the linear hash function , where is a finite field and . The function is defined by , where is chosen uniformly at random. For the hash function is where are chosen uniformly at random. It is shown in [1] that there exists a set such that if is a square and if is a prime power that is not a square. In [1] it is also shown that when is the field of two elements the expected length of the longest chain is improving the results in [4, 5].

Broder et al. [2] considered in the context of min-wise hashing.

2 Preliminaries

denotes the integers, and denotes the integers . is the set of elements of having a multiplicative inverse. is the set of integers from to , that is . For a pair of integers such that we let denote the greatest common divisor of and . If then and are said to be coprime.

For integers we let denote the residue class of . We let be the unique mapping that satisfies . For we let .

For a set and an element the sets and are defined as and , respectively.

For integers , let denote the set

 Im(r,s)={[r]m,[r+1]m,…,[r+s−1]m}.

The set is called an interval. A non-empty set is an interval if there exists such that .

3 Main Result

In this section we prove the main results of this paper, namely Theorem 1 and Theorem 2. The proofs of the two theorems are very similar and both rely on Lemma 1 below.

Lemma 1.

Let be integers satisfying and let be a set of size . Let be a set of size satisfying the following conditions:

• for all .

• , and are pairwise coprime for every with .

Assume that for every there exists an interval of size such that contains at least elements. Then there exists at least ordered pairs of different elements such that .

Proof.

We note that for every the set is the union of disjoint intervals of size , and we write it as such a union . For any the set is either empty or an interval. So for each there is at most one index such that the intersection is non-empty. For every and , let denote the number of elements such that is non-empty. Note that since . Furthermore since each has a non-empty intersection with at most one of the sets .

The number of ordered pairs of different elements such that is exactly since is an interval of size and . Let , then the number of pairs is at least . We can lower bound the number of such pairs in by considering the pairs in for each and and note that each pair we count is counted at most times. This gives that the number of ordered pairs such that is at least:

 ∑b∈B∑j∈[ι(b)](τ(b,j))2δ(b,j) (2)

For any , by the Cauchy-Schwartz inequality we have that:

 ⎛⎝∑j∈[ι(b)]δ(b,j)⎞⎠⎛⎝∑j∈[ι(b)](τ(b,j))2δ(b,j)⎞⎠≥⎛⎝∑j∈[ι(b)]τ(b,j)⎞⎠2 (3)

We clearly have that . Also recall, that we have that . Combining this with (2) and (3) gives that contains at least of the desired pairs. ∎

Below is a proof of Theorem 1.

Theorem 1.

Let be integers with a prime and . Let be a set of elements. Let be defined by where are chosen uniformly at random. Let be the random variable counting the number of elements that hash to the most popular hash value, that is

 M=M(X)=maxy∈Zm|{x∈X∣h(x)=y}|.

Then

 E[M]=O(3√nlogn). (4)
Proof.

We note that since is constant when . Therefore:

 E[M]=p−1pE[M∣a≠0]+1pE[M∣a=0]

Therefore it suffices to bound the expected value of when is chosen uniformly at random from and not from . So from now on, assume that is chosen uniformly at random from .

The random variables and are independent. Note, that can be rewritten as . It clearly suffices to bound the expected value of conditioned on all possible values . For any fixed value of , the expected value of conditioned on is the same as the expected value of conditioned on . Therefore it suffices to give the proof under the assumption that . So we assume that .

Let , then there exists an interval of size at most that contains elements of for the following reason: Let be defined by . By definition, there exists a random variable such that . And there exists a such that

 f−1(y)={[i+km]p∣k∈Z,0≤k

and hence is an interval of size that contains elements of .

Let . We are now going to bound the probability that . Let and let be the set of all elements such that if .

Let be the set of all elements that satisfies that is a prime in the interval . Let be the set of all elements such that . Note, that is a random variable. By linearity of expectation, we have that . Recall, that . For any we have that and therefore there exists an interval of size that contains at least elements of . By Lemma 1, this implies that there are ordered pairs of different elements such that . So the expected number of elements such that is at least . On the other hand, for each ordered pair of different elements the probability that is at most , and by linearity of expectation the expected number of such ordered pairs is at most

 n(n−1)⋅2pmα(p−1)≤2nα.

We conclude that . By the prime number theorem, . Reordering gives us that:

 Pr[M≥4α]=δ=O(nlognα3).

The expected value of can now be bounded in the following manner:

 E[M] =∞∑k=1Pr[M≥k] =⌊3√nlogn⌋∑k=1Pr[M≥k]+n∑k=⌊3√nlogn⌋+1Pr[M≥k] =O(3√nlogn)

which was what we wanted. ∎

The proof of Theorem 2 is very similar to the proof of Theorem 1 but we include it for completeness.

Theorem 2.

Let be integers with and . Let be a set of elements. Let be defined by where are chosen uniformly at random. Let . Then

 E[M]=O(3√nlogn). (5)
Proof.

Let be a random variable such that , and let . The set is an interval of size that contains exactly elements of .

Let . We are now going to bound the probability that . Let , and let be the set of all elements such that if .

Let be the set of all elements that satisfies that is a prime in the interval . Let be the set of all elements such that . Note, that is a random variable. By linearity of expectation, we have that . Recall, that . For any we have that and therefore there exists an interval of size that contains at least elements of . By Lemma 1, this implies that there are ordered pairs of different elements such that . So the expected number of elements such that is at least . On the other hand, for each ordered pair of different elements the probability that is at most , and by linearity of expectation the expected number of such ordered pairs is at most

 n(n−1)⋅4mα≤4nα.

We conclude that , and now we can bound the expected value exactly as in Theorem 1. ∎

References

• [1] Noga Alon, Martin Dietzfelbinger, Peter Bro Miltersen, Erez Petrank, and Gábor Tardos. Linear hash functions. Journal of the ACM (JACM), 46(5):667--683, 1999.
• [2] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000.
• [3] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closest-pair problem. Journal of Algorithms, 25(1):19--51, 1997.
• [4] George Markowsky, Larry Carter, and Mark N. Wegman. Analysis of a universal class of hash functions. In Mathematical Foundations of Computer Science 1978, Proceedings, 7th Symposium, Zakopane, Poland, September 4-8, 1978, pages 345--354, 1978.
• [5] Kurt Mehlhorn and Uzi Vishkin. Randomized and deterministic simulations of prams by parallel machines with restricted granularity of parallel memories. Acta Inf., 21:339--374, 1984.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters