Faster 64bit universal hashing using carryless multiplications
Abstract
Intel and AMD support the Carryless Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least \SI60\percent faster. We also compare CLHASH with a popular hash function designed for speed (Google’s CityHash). We find that CLHASH is \SI40\percent faster than CityHash on inputs larger than 64 bytes and just as fast otherwise.
Keywords:
Universal hashing, Carryless multiplication, Finite field arithmetic1 Introduction
Hashing is the fundamental operation of mapping data objects to fixedsize hash values. For example, all objects in the Java programming language can be hashed to 32bit integers. Many algorithms and data structures rely on hashing: e.g., authentication codes, Bloom filters and hash tables. We typically assume that given two data objects, the probability that they have the same hash value (called a collision) is low. When this assumption fails, adversaries can negatively impact the performance of these data structures or even create denialofservice attacks. To mitigate such problems, we can pick hash functions at random (henceforth called random hashing).
Random hashing is standard in Ruby, Python and Perl. It is allowed explicitly in Java and C++11.
There are many fast random hash families — e.g., MurmurHash, Google’s CityHash (35), SipHash (3) and VHASH (12).
Cryptographers have also designed fast hash families with strong theoretical guarantees (6); (18); (24). However, much of this work predates the introduction of the CLMUL instruction set in commodity x86 processors. Intel and AMD added CLMUL and its pclmulqdq instruction to their processors to accelerate some common
cryptographic operations.
Although the pclmulqdq instruction first became available
in 2010, its high cost in terms of CPU cycles — specifically an 8cycle throughput on preHaswell Intel microarchitectures and a 7cycle throughput on preJaguar AMD microarchitectures — limited its usefulness outside of cryptography.
However, the throughput of the instruction on the newer Haswell architecture is down to 2 cycles, even though it remains a high latency operation (7 cycles) (16); (21).
intrinsic  instruction  description  latency  rec. thr. 

_mm_clmulepi64_si128  pclmulqdq  64bit carryless multiplication  7  2 
_mm_or_si128  por  bitwise OR  1  0.33 
_mm_xor_si128  pxor  bitwise XOR  1  0.33 
_mm_slli_epi64  psllq  shift left two 64bit integers  1  1 
_mm_srli_si128  psrldq  shift right by bytes  1  0.5 
_mm_shuffle_epi8  pshufb  shuffle 16 bytes  1  0.5 
_mm_cvtsi64_si128  movq  64bit integer as 128bit reg.  1  – 
_mm_cvtsi128_si64  movq  64bit integer from 128bit reg.  2  – 
_mm_load_si128  movdqa  load a 128bit reg. from memory (aligned)  1  0.5 
_mm_lddqu_si128  lddqu  load a 128bit reg. from memory (unaligned)  1  0.5 
_mm_setr_epi8  –  construct 128bit reg. from 16 bytes  –  – 
_mm_set_epi64x  –  construct 128bit reg. from two 64bit integers  –  – 
2 Random Hashing
In random hashing, we pick a hash function at random from some family, whereas an adversary might pick the data inputs. We want distinct objects to be unlikely to hash to the same value. That is, we want a low collision probability.
We consider hash functions from to . An bit family is universal (10); (11) if the probability of a collision is no more than . That is, it is universal if
for any fixed such that , given that we pick at random from the family. It is almost universal (36) (also written AU) if the probability of a collision is bounded by . I.e., , for any such that . (See Table 2.)
bit hash function  

universal  for 
almost universal  for 
XORuniversal  for any and distinct 
almost XORuniversal  for any integer and distinct 
2.1 Safely Reducing Hash Values
Almost universality can be insufficient to prevent frequent collisions since a given algorithm might only use the first few bits of the hash values. Consider hash tables. A hash table might use as a key only the first bits of the hash values when its capacity is . Yet even if a hash family is almost universal, it could still have a high collision probability on the first few bits.
For example, take any bit universal family , and derive the new bit almost universal bit family by taking the functions from and multiplying them by : . Clearly, all functions from this new family collide with probability 1 on the first 32 bits, even though the collision probability on the full hash values is low (). Using the first bits of these hash functions could have disastrous consequences in the implementation of a hash table.
Therefore, we consider stronger forms of universality.

A family is almost XORuniversal if
for any integer constant and any such that (where is the bitwise XOR). A family that is almost XORuniversal is said to be XORuniversal (37).
Given an almost universal family of hash functions , the family of hash functions
from to is almost universal (12). The next lemma shows that a similar result applies to almost XORalmost universal families.
Lemma 1
Given an almost XORuniversal family of hash functions and any positive integer , the family of hash functions from to is almost XORuniversal.
Proof
For any integer constant , consider the equation for with picked from . Pick any positive integer . We have
where the sum is over distinct values. Because is almost XORuniversal, we have that for any and any . Thus, we have that , showing the result.
It follows from Lemma 1 that if a family is XORuniversal, then its modular reductions are XORuniversal as well.
As a straightforward extension of this lemma, we could show that when picking any bits (not only the least significant), the result is almost XORuniversal.
2.2 Composition
It can be useful to combine different hash families to create new ones. For example, it is common to compose hash families. When composing hash functions (), the universality degrades linearly: if is picked from an almost universal family and is picked (independently) from an almost universal family, the result is almost universal (36).
We sketch the proof. For , we have that collides if . This occurs with probability at most since is picked from an almost universal family. If not, they collide if where and , with probability bounded by . Thus, we have bounded the collision probability by , establishing the result.
By extension, we can show that if is picked from an almost XORuniversal family, then the composed result () is going to be almost XORuniversal. It is not required for to be almost XORuniversal.
2.3 Hashing Tuples
If we have universal hash functions from to , then we can construct hash functions from to while preserving universality. The construction is straightforward: . If is picked from an almost universal family, then the result is almost universal. This is true even though a single is picked and reused times.
Lemma 2
Consider an almost universal family from to . Then consider the family of functions of the form from to , where is in . Family is almost universal.
The proof is not difficult. Consider two distinct values from , and . Because the tuples are distinct, they must differ in at least one component: . It follows that and collide with probability at most , showing the result.
2.4 VariableLength Hashing From FixedLength Hashing
Suppose that we are given a family of hash functions that is XOR universal over fixedlength strings. That is, we have that if the length of is the same as the length of (). We can create a new family that is XOR universal over variablelength strings by introducing a hash family on string lengths. Let be a family of XOR universal hash functions over length values. Consider the new family of hash functions of the form where and . Let us consider two distinct strings and . There are two cases to consider.

If and have the same length so that then we have XOR universality since
where the last inequality follows because , an XOR universal family over fixedlength strings.

If the strings have different lengths (), then we again have XOR universality because
where we set , a value independent from and . The last inequality follows because is taken from a family that is XOR universal.
Thus the result () is XOR universal. We can also generalize the analysis. Indeed, if and are almost universal, we could show that the result is almost universal. We have the following lemma.
Lemma 3
Let be an XOR universal family of hash functions over fixedlength strings. Let be an XOR universal family of hash functions over integer values. We have that the family of hash functions of the form where and is XOR universal over all strings.
Moreover, if and are merely almost universal, then the family of hash functions of the form is also almost universal.
2.5 Minimally Randomized Hashing
Many hashing algorithms — for instance, CityHash (35) — rely on a small random seed. The 64bit version of CityHash takes a 64bit integer as a seed. Thus, we effectively have a family of hash functions — one for each possible seed value.
Given such a small family (i.e., given few random bits), we can prove that it must have high collision probabilities. Indeed, consider the set of all strings of 64bit words. There are such strings.

Pick one hash function from the CityHash family. This function hashes every one of the strings to one of hash values. By a pigeonhole argument (31), there must be at least one hash value where at least strings collide.

Pick another hash function. Out of the strings colliding when using the first hash function, we must have strings also colliding when using the second hash function.
We can repeat this process times until we find strings colliding when using any of these hash functions. If an adversary picks any two of our strings and we pick the hash function at random in the whole family of hash functions, we get a collision with a probability of at least . Thus, while we do not have a strict bound on the collision probability of the CityHash family, we know just from the small size of its seed that it must have a relatively high collision probability for long strings. In contrast, VHASH and our CLHASH (see § 5) use more than 64 random bits and have correspondingly better collision bounds (see Table 4).
3 Vhash
The VHASH family (12); (25) was designed for 64bit processors. By default, it operates over 64bit words. Among hash families offering good almost universality for large data inputs, VHASH might be the fastest 64bit alternative on x64 processors — except for our own proposal (see § 5).
VHASH is almost universal and builds on the 128bit NH family (12):
(1)  
NH is almost universal with hash values in . Although the NH family is defined only for inputs containing an even number of components, we can extend it to include odd numbers of components by padding the input with a zero component.
We can summarize VHASH (see Algorithm 1) as follows:

NH is used to generate a 128bit hash value for each block of 16 words. The result is almost universal on each block.

These hash values are mapped to a value in by applying a modular reduction. These reduced hash values are then aggregated with a polynomial hash and finally reduced to a 64bit value.
In total, the VHASH family is almost universal over for input strings of up to bits ((12), Theorem 1).
For long input strings, we expect that much of the running time of VHASH is in the computation of NH on blocks of 16 words. On recent x64 processors, this computation involves 8 multiplications using the mulq instruction (with two 64bit inputs and two 64bit outputs). For each group of two consecutive words ( and ), we also need two 64bit additions. To sum all results, we need 7 128bit additions that can be implemented using two 64bit additions (addq and adcq). All of these operations have a throughput of at least 1 per cycle on Haswell processors. We can expect NH and, by extension, VHASH to be fast.
VHASH uses only 16 64bit random integers for the NH family. As in § 2.3, we only need one specific NH function irrespective of the length of the string. VHASH also uses a 128bit random integer and two more 64bit random integers and . Thus VHASH uses slightly less than 160 random bytes.
3.1 Random Bits
Nguyen and Roscoe showed
that
at least random bits are required (31),
That is, 16 random bytes are theoretically required to achieve the same collision bound as VHASH while many more are used (160 bytes) This suggests that we might be able to find families using far fewer random bits while maintaining the same good bounds. In fact, it is not difficult to modify VHASH to reduce the use of random bits. It would suffice to reduce the size of the blocks down from 16 words. We could show that it cannot increase the bound on the collision probability by more than . However, reducing the size of the blocks has an adverse effect on speed. With large blocks and long strings, most of the input is processed with the NH function before the more expensive polynomial hash function is used. Thus, there is a tradeoff between speed and the number of random bits, and VHASH is designed for speed on long strings.
4 Finite Fields
Our proposed hash family (CLHASH, see § 5) works over a binary finite field. For completeness, we review field theory briefly, introducing (classical) results as needed for our purposes.
The real numbers form what is called a field. A field is such that addition and multiplication are associative, commutative and distributive. We also have identity elements (0 for addition and 1 for multiplication). Crucially, all nonzero elements have an inverse (which is defined by ).
Finite fields (also called Galois fields) are fields containing a finite number of elements. All finite fields have cardinality for some prime . Up to an algebraic isomorphism (i.e., a onetoone map preserving addition and multiplication), given a cardinality , there is only one field (henceforth ). And for any power of a prime, there is a corresponding field.
4.1 Finite Fields of Prime Cardinality
It is easy to create finite fields that have prime cardinality (). Given , an instance of is given by the set of integers in with additions and multiplications completed by a modular reduction:


and .
The numbers 0 and 1 are the identity elements. Given an element , its additive inverse is .
It is not difficult to check that all nonzero elements have a multiplicative inverse. We review this classical result for completeness. Given a nonzero element and two distinct , we have that because is prime. Hence, starting with a fixed nonzero element , we have that the set has cardinality and must contain 1; thus, must have a multiplicative inverse.
4.2 Hash Families in a Field
Within a field, we can easily construct hash families having strong theoretical guarantees, as the next lemma illustrates.
Lemma 4
The family of functions of the form
in a finite field () is universal, provided that the key is picked from all values of the field.
As another example, consider hash functions of the form where is picked at random (a random input). Such polynomial hash functions can be computed efficiently using Horner’s rule: starting with , compute for . Given any two distinct inputs, and , we have that is a nonzero polynomial of degree at most in . By the fundamental theorem of algebra, we have that it is zero for at most distinct values of . Thus we have that the probability of a collision is bounded by where is the cardinality of the field. For example, VHASH uses polynomial hashing with and .
We can further reduce the collision probabilities if we use random inputs picked in the field to compute a multilinear function: . We have universality. Given two distinct inputs, and , we have that for some . Thus we have that if and only if .
If is even, we can get the same bound on the collision probability with half the number of multiplications (7); (26); (29):
The argument is similar. Consider that
Take two distinct inputs, and . As before, we have that for some . Without loss of generality, assume that is odd; then we can find a unique solution for : to do this, start from and solve for in terms of an expression that does not depend on . Then use the fact that has an inverse. This shows that the collision probability is bounded by and we have universality.
Lemma 5
Given an even number , the family of functions of the form
in a finite field () is universal, providing that the keys are picked from all values of the field. In particular, the collision probability between two distinct inputs is bounded by .
4.3 Binary Finite Fields
Finite fields having prime cardinality are simple (see § 4.1), but we would prefer to work with fields having a poweroftwo cardinality (also called binary fields) to match common computer architectures. Specifically, we are interested in because our desktop processors typically have 64bit architectures.
We can implement such a field over the integers in by using the following two operations. Addition is defined as the bitwise XOR () operation, which is fast on most computers:
The number 0 is the additive identity element (), and every number is its own additive inverse: . Note that because binary finite fields use XOR as an addition, universality and XORuniversality are effectively equivalent for our purposes in binary finite fields.
Multiplication is defined as a carryless multiplication followed by a reduction. We use the convention that is the least significant bit of integer and if is larger than the most significant bit of . The bit of the carryless multiplication of and is given by
(2) 
where is just a regular multiplication between two integers in and is the bitwise XOR of a range of values. The carryless product of two bit integers is a bit integer. We can check that the integers with as addition and as multiplication form a ring: addition and multiplication are associative, commutative and distributive, and there is an additive identity element. In this instance, the number 1 is a multiplicative identity element (). Except for the number 1, no number has a multiplicative inverse in this ring.
Given the ring determined by and , we can derive a corresponding finite field. However, just as with finite fields of prime cardinality, we need some kind of modular reduction and a concept equivalent to that of prime numbers
Let us define to be the position of the most significant nonzero bit of , starting at 0 (e.g., , , ). For example, we have for any 128bit integer . Given any two nonzero integers , we have that as a straightforward consequence of Equation 2. Similarly, we have that
Not unlike regular multiplication, given integers with , there are unique integers (henceforth the quotient and the remainder) such that
(3) 
where .
The uniqueness of the quotient and the remainder is easily shown. Suppose that there is another pair of values with the same property. Then which implies that . However, since we must have that . From this it follows that , thus establishing uniqueness.
We define and operators as giving respectively the quotient () and remainder () so that the equation
(4) 
is an identity equivalent to Equation 3. (To avoid unnecessary parentheses, we use the following operator precedence convention: , and are executed first, from left to right, followed by .)
In the general case, we can compute and using a straightforward variation on the Euclidean division algorithm (see Algorithm 2) which proves the existence of the remainder and quotient. Checking the correctness of the algorithm is straightforward. We start initially with values and such that . By inspection, this equality is preserved throughout the algorithm. Meanwhile, the algorithm only terminates when the degree of is less than that of , as required. And the algorithm must terminate, since the degree of is reduced by at least one each time it is updated (for a maximum of steps).
Given and , we have that . Thus, it can be checked that divisions and modular reductions are distributive:
(5) 
(6) 
Thus, we have . Moreover, by inspection, we have that and .
The carryless multiplication by a power of two is equivalent to regular multiplication. For this reason, a modular reduction by a power of two (e.g., ) is just the regular integer modular reduction. Idem for division.
There are nonzero integers such that there is no integer other than 1 such that ; effectively is a prime number under the carryless multiplication interpretation. These “prime integers” are more commonly known as irreducible polynomials in the ring of polynomials , so we call them irreducible instead of prime. Let us pick such an irreducible integer (arbitrarily) such that the degree of is 64. One such integer is . Then we can finally define the multiplication operation in :
Coupled with the addition that is just a bitwise XOR, we have an implementation of the field over integers in .
We call the index of the second most significant bit the subdegree. We chose an irreducible of degree 64 having minimal subdegree (4).
4.4 Efficient Reduction in
AMD and Intel have introduced a fast instruction that can compute a carryless multiplication between two 64bit numbers, and it generates a 128bit integer. To get the multiplication in , we must still reduce this 128bit integer to a 64bit integer. Since there is no equivalent fast modular instruction, we need to derive an efficient algorithm.
There are efficient reduction algorithms used in cryptography (e.g., from 256bit to 128bit integers (17)), but they do not suit our purposes: we have to reduce to 64bit integers. Inspired by the classical Barrett reduction (5), Knežević et al. proposed a generic modular reduction algorithm in , using no more than two multiplications (22). We put this to good use in previous work (26). However, we can do the same reduction using a single multiplication. According to our tests, the reduction technique presented next is \SI30\percent faster than an optimized implementation based on Knežević et al.’s algorithm.
Let us write . In our case, we have and . We are interested in applying a modular reduction by to the result of the multiplication of two integers in , and the result of such a multiplication is an integer such that . We want to compute quickly. We begin with the following lemma.
Lemma 6
Consider any 64bit integer . We define the operations and as the counterparts of the carryless multiplication as in § 4.3. Given any , we have that
where .
Proof
We have that for any by definition. Applying the modular reduction on both sides of the equality, we get
by Fact 1  
by Fact 2  
by z’s def.  
by Fact 3 
where Facts 1, 2 and 3 are as follows:

(Fact 1) For any , we have that .

(Fact 2) For any integer , we have that and therefore
by the distributivity of the modular reduction (Equation 5).

(Fact 3) Recall that by definition . We can substitute this equation in the equation from Fact 1. For any and any nonzero , we have that
by the distributivity of the modular reduction (see Equation 5).
Hence the result is shown.
Lemma 6 provides a formula to compute . Computing involves a carryless multiplication, which can be done efficiently on recent Intel and AMD processors. The computation of and is trivial. It remains to compute . At first glance, we still have a modular reduction. However, we can easily memoize the result of . The next lemma shows that there are only 16 distinct values to memoize (this follows from the low subdegree of ).
Lemma 7
Given that has degree less than 128, there are only 16 possible values of , where and .
Proof
Indeed, we have that
Because , we have that . Therefore, we have . Hence, we can represent using 4 bits: there are only 16 4bit integers.
Thus, in the worst possible case, we would need to memoize 16 distinct 128bit integers to represent . However, observe that the degree of is bounded by since . By using Lemma 8, we show that each integer has degree bounded by 7 so that it can be represented using no more than 8 bits: setting and , , and .
Effectively, the lemma says that if you take a value of small degree , you multiply it by and then compute the modular reduction on the result and a value that is almost (except for a value of small degree ), then the result has small degree: it is bounded by the sum of the degrees of and .
Lemma 8
Consider , with of degree less than . For any , the degree of is bounded by .
Moreover, when then the degree of is exactly .
Proof
The result is trivial if , since the degree of must be smaller than the degree of .
So let us assume that . By the definition of the modular reduction (Equation 4), we have
Let , then
The first bits of and are zero. Therefore, we have
Moreover, the degree of is the same as the degree of : . Hence, we have . And, of course, . Thus, we have that
Hence, it follows that .
Thus the memoization requires access to only 16 8bit values. We enumerate the values in question ( for ) in Table 3. It is convenient that bits: the entire table fits in a 128bit word. It means that if the list of 8bit values are stored using one byte each, the SSSE3 instruction pshufb can be used for fast lookup. (See Algorithm 3.)
decimal  binary  decimal  binary 

0  0  
1  27  
2  54  
3  45  
4  108  
5  119  
6  90  
7  65  
8  216  
9  195  
10  238  
11  245  
12  180  
13  175  
14  130  
15  153 
5 Clhash
The CLHASH family resembles the VHASH family — except that members work in a binary finite field. The VHASH family has the 128bit NH family (see Equation 1), but we instead use the 128bit CLNH family:
(7) 
where the and ’s are 64bit integers and is the length of the string . The formula assumes that is even: we pad oddlength inputs with a single zero word. When an input string is made of bytes, we can consider it as string of 64bit words by padding it with up to 7 zero bytes so that is divisible by 8.
On x64 processors with the CLMUL instruction set, a single term can be computed using one 128bit XOR instructions (pxor in SSE2) and one carryless multiplication using the pclmulqdq instruction:

load in a 128bit word,

load