Decoding billions of integers per second through vectorization
Abstract
In many important applications—such as search engines and relational database systems—data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMDBP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varintG8IU and PFOR). At the same time, SIMDBP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMDFastPFOR) that has a compression ratio within 10% of a stateoftheart scheme (Simple8b) while being two times faster during decoding.
Decoding billions of integers per second through vectorization
D. Lemire^{1}^{1}1Correspondence to: LICEF Research Center, TELUQ, Université du Québec, 5800 SaintDenis, Montreal (Quebec) H2S 3L5 Canada., L. Boytsov
LICEF Research Center, TELUQ, Montreal, QC, CanadaCarnegie Mellon University, Pittsburgh, Pennsylvania, USA
KEY WORDS: performance; measurement; index compression; vector processing
Contract/grant sponsor: Natural Sciences and Engineering Research Council of Canada; contract/grant number: 261437
1 Introduction
Computer memory is a hierarchy of storage devices that range from slow and inexpensive (disk or tape) to fast but expensive (registers or CPU cache). In many situations, application performance is inhibited by access to slower storage devices, at lower levels of the hierarchy. Previously, only disks and tapes were considered to be slow devices. Consequently, application developers tended to optimize only disk and/or tape I/O. Nowadays, CPUs have become so fast that access to main memory is a limiting factor for many workloads [1, 2, 3, 4, 5]: data compression can significantly improve query performance by reducing the mainmemory bandwidth requirements.
Data compression helps to load and keep more of the data into a faster storage. Hence, high speed compression schemes can improve the performances of database systems [6, 7, 8] and text retrieval engines [9, 10, 11, 12, 13].
We focus on compression techniques for 32bit integer sequences. It is best if most of the integers are small, because we can save space by representing small integers more compactly, i.e., using short codes. Assume, for example, that none of the values is larger than 255. Then we can encode each integer using one byte, thus, achieving a compression ratio of 4: an integer uses 4 bytes in the uncompressed format.
In relational database systems, column values are transformed into integer values by dictionary coding [14, 15, 16, 17, 18]. To improve compressibility, we may map the most frequent values to the smallest integers [19]. In text retrieval systems, word occurrences are commonly represented by sorted lists of integer document identifiers, also known as posting lists. These identifiers are converted to small integer numbers through data differencing. Other database indexes can also be stored similarly [20].
A mainstream approach to data differencing is differential coding (see Fig. 3). Instead of storing the original array of sorted integers ( with for all ), we keep only the difference between successive elements together with the initial value: (). The differences (or deltas) are nonnegative integers that are typically much smaller than the original integers. Therefore, they can be compressed more efficiently. We can then reconstruct the original arrays by computing prefix sums (). Differential coding is also known as delta coding [18, 21, 22], not to be confused with Elias delta coding (§ 2.3). A possible downside of differential coding is that random access to an integer located at a given index may require summing up several deltas: if needed, we can alleviate this problem by partitioning large arrays into smaller ones.
An engineer might be tempted to compress the result using generic compression tools such as LZO, Google Snappy, FastLZ, LZ4 or gzip. Yet this might be illadvised. Our fastest schemes are an order of magnitude faster than a fast generic library like Snappy, while compressing better (see § 6.5).
Instead, it might be preferable to compress these arrays of integers using specialized schemes based on SingleInstruction, MultipleData (SIMD) operations. Stepanov et al. [12] reported that their SIMDbased varintG8IU algorithm outperformed the classic variable byte coding method (see § 2.4) by 300%. They also showed that use of SIMD instructions allows one to improve performance of decoding algorithms by more than 50%.
In Table I, we report the speeds of the fastest decoding algorithms reported in the literature on desktop processors. These numbers cannot be directly compared since hardware, compilers, benchmarking methodology, and data sets differ. However, one can gather that varintG8IU—which can be viewed as an improvement on the Group Varint Encoding [13] (varintGB) used by Google—is, probably, the fastest method (except for our new schemes) in the literature. According to our own experimental evaluation (see Tables IV, V and Fig. 27), varintG8IU is indeed one of the most efficient methods, but there are previously published schemes that offer similar or even slightly better performance such as PFOR [23]. We, in turn, were able to further surpass the decoding speed of varintG8IU by a factor of two while improving the compression ratio.
We report our own speed in a conservative manner: (1) our timings are based on the wallclock time and not the commonly used CPU time, (2) our timings incorporate all of the decoding operations including the computation of the prefix sum whereas this is sometimes omitted by other authors [24], (3) we report a speed of 2300 million integers per second (mis) achievable for realistic data sets, while higher speed is possible (e.g., we report a speed of 2500 mis on some realistic data and 2800 mis on some synthetic data).
Another observation we can make from Table I is that not all authors have chosen to make explicit use of SIMD instructions. While there are has been several variations on PFOR [23] such as NewPFD and OptPFD [10], we introduce for the first time a variation designed to exploit the vectorization instructions available since the introduction of the Pentium 4 and the Streaming SIMD Extensions 2 (henceforth SSE2). Our experimental results indicate that such vectorization is desirable: our SIMDFastPFOR scheme surpasses the decoding speed of PFOR by at least 30% while offering a superior compression ratio (10%). In some instances, SIMDFastPFOR is twice as fast as the original PFOR.
For most schemes, the prefix sum computation is so fast as to represent 20% or less of the running time. However, because our novel schemes are much faster, the prefix sum can account for the majority of the running time.
Hence, we had to experiment with faster alternatives. We find that a vectorized prefix sum using SIMD instructions can be twice as fast. Without vectorized differential coding, we were unable to reach a speed of two billion integers per second.
In a sense, the speed gains we have achieved are a direct application of advanced hardware instructions to the problem of integer coding (specifically SSE2 introduced in 2001). Nevertheless, it is instructive to show how this is done, and to quantify the benefits that accrue.
Speed  Cycles/int  Fastest scheme  Processor  SIMD  

this paper  2300  1.5  SIMDBP128  Core i7 (3.4 GHz)  SSE2 
Stepanov et al. (2011) [12]  1512  2.2  varintG8IU  Xeon (3.3 GHz)  SSSE3 
Anh and Moffat (2010) [25]  1030  2.3  binary packing  Xeon (2.33 GHz)  no 
Silvestri and Venturini (2010) [24]  835  —  VSEncoding  Xeon  no 
Yan et al. (2009) [10]  1120  2.4  NewPFD  Core 2 (2.66 GHz)  no 
Zhang et al. (2008) [26]  890  3.6  PFOR2008  Pentium 4 (3.2 GHz)  no 
Zukowski et al. (2006) [23, § 5]  1024  2.9  PFOR  Pentium 4 (3 GHz)  no 
2 Related work
Some of the earliest integer compression techniques are Golomb coding [27], Rice coding [28], as well as Elias gamma and delta coding [29]. In recent years, several faster techniques have been added such as the Simple family, binary packing, and patched coding. We briefly review them.
Because we work with unsigned integers, we make use of two representations: binary and unary. In both systems numbers are represented using only two digits: 0 and 1. The binary notation is a standard positional base2 system (e.g., , , ). Given a positive integer , the binary notation requires bits. Computers commonly store unsigned integers in the binary notation using a fixed number of bits by adding leading zeros: e.g., is written as using 8 bits. In unary notation, we represent a number as a sequence of digits 0 followed by the digit 1 (e.g., , , ) [30]. If the number can be zero, we can store instead.
2.1 Golomb and Rice coding
In Golomb coding [27], given a fixed parameter and a positive integer to be compressed, the quotient is coded in unary. The remainder is stored using the usual binary notation with no more than bits. If can be zero, we can code instead. When is chosen to be a power of two, the resulting algorithm is called Rice coding [28]. The parameter can be chosen optimally by assuming some that the integers follow a known distribution [27].
2.2 Interpolative coding
If speed is not an issue but high compression over sorted arrays is desired, interpolative coding [32] might be appealing. In this scheme, we first store the lowest and the highest value, and , e.g., in a uncompressed form. Then a value inbetween is stored in a binary form, using the fact this value must be in the range . For example, if and , we know that for any value in between, the difference is from 0 to 15. Hence, we can encode this difference using only 4 bits. The technique is then repeated recursively. Unfortunately, it is slower than Golomb coding [9, 10].
2.3 Elias gamma and delta coding
An Elias gamma code [29, 30, 33] consists of two parts. The first part encodes in unary notation the minimum number of bits necessary to store the positive integer in the binary notation (). The second part represents the integer in binary notation less the most significant digit. If the integer is equal to one, the second part is empty (e.g., , , , , ). If integers can be zero, we can code their values incremented by one.
As numbers become large, gamma codes become inefficient. For better compression, Elias delta codes encode the first part (the number ) using the Elias gamma code, while the second part is coded in the binary notation. For example, to code the number 8 using the Elias delta code, we must first store as a gamma code () and then we can store all but the most significant bit of the number 8 in the binary notation (). The net result is .
However, Variable Byte is twice as fast as Elias gamma and delta coding [24]. Hence, like Golomb coding, gamma coding falls short of our objective of compressing billions of integers per second.
2.3.1 gamma
Schlegel et al. [34] proposed a version of Elias gamma coding better suited to current processors. To ease vectorization, the data is stored in blocks of integers using the same number of bits where . (This approach is similar to binary packing described in § 2.6.) As with regular gamma coding, we use unary codes to store this number of bits though we only have one such number for integers.
The binary part of the gamma codes are stored using the same vectorized layout as in § 4 (known as vertical or interleaved). During decompression, we decode integer in groups of integers. For each group we first retrieve the binary length from a gamma code. Then, we decode group elements using a sequence of maskandshift operations similar to the fast bit unpacking technique described in § 4. This step does not require branching.
Schlegel et al. report best decoding speeds of 550 mis (2100 MB/s) on synthetic data using an Intel Core i7920 processor (2.67 GHz). These results fall short of our objective to compress billions of integers per second.
2.4 Variable Byte and byteoriented encodings
Variable Byte is a popular technique [35] that is known under many names (vbyte, variablebyte [36], varbyte, vbyte [30], varint, VInt, VB [12] or Escaping [31]). To our knowledge, it was first described by Thiel and Heaps in 1972 [37]. Variable Byte codes the data in units of bytes: it uses the lowerorder seven bits to store the data, while the eighth bit is used as an implicit indicator of a code length. Namely, the eighth bit is equal to 1 only for the last byte of a sequence that encodes an integer. For example:

Integers in are written using one byte: The first 7 bits are used to store the binary representation of the integer and the eighth bit is set to 1.

Integers in are written using two bytes, the eighth bit of the first byte is set to 0 whereas the eighth bit of the second byte is set to 1. The remaining 14 bits are used to store the binary representation of the integer.
For a concrete example, consider the number 200. It is written as 11001000 in the binary notation. Variable Byte would code it using 16 bits as 10000001 01001000.
When decoding, bytes are read one after the other: we discard the eighth bit if it is zero, and we output a new integer whenever the eighth bit is one.
Though Variable Byte rarely compresses data optimally, it is reasonably efficient. In our tests, Variable Byte encodes data three times faster than most alternatives. Moreover, when the data is not highly compressible, it can match the compression ratios of more parsimonious schemes.
Stepanov et al. [12] generalize Variable Byte into a family of byteoriented encodings. Their main characteristic is that each encoded byte contains bits from only one integer. However, whereas Variable Byte uses one bit per byte as descriptor, alternative schemes can use other arrangements. For example, varintG8IU [12] and Group Varint [13] (henceforth varintGB) regroup all descriptors in a single byte. Such alternative layouts make easier the simultaneous decoding of several integers. A similar approach to placing descriptors in a single control word was used to accelerate a variant of the LempelZiv algorithm [38].
For example, varintGB uses a single byte to describe 4 integers, dedicating 2 bits per integer. The scheme is better explained by an example. Suppose that we want to store the integers , , , and . In the usual binary notation, we would use 2, 3, 1 and 4 bytes, respectively. We can store the sequence as 2, 3, 1, 4 as 1, 2, 0, 3 if we assume that each number is encoded using a nonzero number of bytes. Each one of these 4 integers can be written using 2 bits (as they are in {0,1,2,3}). We can pack them into a single byte containing the bits 01,10,00, and 11. Following this byte, we write the integer values using bytes.
Whereas varintGB codes a fixed number of integers (4) using a single descriptor, varintG8IU uses a single descriptor for a group of 8 bytes, which represent compressed integers. Each 8byte group may store from 2 to 8 integers. A singlebyte descriptor is placed immediately before this 8byte group. Each bit in the descriptor represents a single data byte. Whenever a descriptor bit is set to 0, then the corresponding byte is the end of an integer. This is symmetrical to the Variable Byte scheme described above, where the descriptor bit value 1 denotes the last byte of an integer code.
In the example we used for varintGB, we could only store the first 3 integers (, , ) into a single 8byte group, because storing all 4 integers would require 10 bytes. These integers use 2, 3, and 1 bytes, respectively, whereas the descriptor byte is equal to 11001101 (in the binary notation). The first two bits (01) of the descriptor tell us that the first integer uses 2 bytes. The next three bits (011) indicate that the second integer requires 3 bytes. Because the third integer uses a single byte, the next (sixth) bit of the descriptor would be 0. In this model, the last two bytes cannot be used and, thus, we would set the last two bits to 1.
On most recent x86 processors, integers packed with varintG8IU can be efficiently decoded using the SSSE3 (Supplemental Streaming SIMD Extensions 3) shuffle instruction: pshufb. This assembly operation selectively copies byte elements of a 16element vector to specified locations of the target 16element buffer and replaces selected elements with zeros.
The name “shuffle” is a misnomer, because certain source bytes can be omitted, while others may be copied multiple times to a number of different locations. The operation takes two 16 element vectors (of bits each): the first vector contains the bytes to be shuffled into an output vector whereas the second vector serves as a shuffle mask. Each byte in the shuffle mask determines which value will go in the corresponding location in the output vector. If the last bit is set (that is, if the value of the byte is larger than 127), the target byte is zeroed. For example, if the shuffle mask contains the byte values , then the output vector will contain only zeros. Otherwise, the first 4 bits of the mask element determine the index of the byte that should be copied to the target byte . For example, if the shuffle mask contains the byte values , then the bytes are simply copied in their original locations.
In Fig. 4, we illustrate one step of the decoding algorithm for varintG8IU. We assume that the descriptor byte, which encodes the 3 numbers of bytes (2, 3, 1) required to store the 3 integers (, , ), is already retrieved. The value of the descriptor byte was used to obtain a proper shuffle mask for pshufb. This mask defines a hardcoded sequence of operations that copy bytes from the source to the target buffer or fill selected bytes of the target buffer with zeros. All these byte operations are carried out in parallel in the following manner (byte numeration starts from zero):

The first integer uses 2 bytes, which are both copied to bytes 0–1 of the target buffer Bytes 2–3 of the target buffer are zeroed.

Likewise, we copy bytes 2–4 of the source buffer to bytes 4–6 of the target buffer. Byte 7 of the target buffer is zeroed.

The last integer uses only one byte 5: we copy the value of this byte to byte 8 and zero bytes 9–11.

The bytes 12–15 of the target buffer are currently unused and will be filled out by subsequent decoding steps. In the current step, we may fill them with arbitrary values, e.g., zeros.
We do not know whether Google implemented varintGB using SIMD instructions [13]. However, Schlegel et al. [34] and Popov [11] described the application of the pshufb instruction to accelerate decoding of a varintGB scheme (which Schlegel et al. called 4wise null suppression).
Stepanov et al. [12] found varintG8IU to compress slightly better than a SIMDbased varintGB while being up to 20% faster. Compared to the common Variable Byte, varintG8IU had a slightly worse compression ratio (up to 10%), but it is 2–3 times faster.
2.5 The Simple family
Whereas Variable Byte takes a fixed input length (a single integer) and produces a variablelength output (1, 2, 3 or more bytes), at each step the Simple family outputs a fixed number of bits, but processes a variable number of integers, similar to varintG8IU. However, unlike varintG8IU, schemes from the Simple family are not byteoriented. Therefore, they may fare better on highly compressible arrays (e.g., they could compress a sequence of numbers in to 1 bit/int).
The most competitive Simple scheme on 64bit processors is Simple8b [25]. It outputs 64bit words. The first 4 bits of every 64bit word is a selector that indicates an encoding mode. The remaining 60 bits are employed to keep data. Each integer is stored using the same number of bits . Simple8b has 2 schemes to encode long sequences of zeros and 14 schemes to encode positive integers. For example:

Selector values 0 or 1 represent sequences containing 240 and 120 zeros, respectively. In this instance the 60 data bits are ignored.

The selector value 2 corresponds to . This allows us to store 60 integers having values in {0,1}, which are packed in the data bits.

The selector value 3 corresponds to and allows one to pack 30 integers having values in in the data bits.
And so on (see Table II): the larger is the value of the selector, the larger is , and the fewer integers one can fit in 60 data bits. During coding, we try successively the selectors starting with value 0. That is, we greedily try to fit as many integers as possible in the next 64bit word.
selector value  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 

integers coded  240  120  60  30  20  15  12  10  8  7  6  5  4  3  2  1 
bits per integer  0  0  1  2  3  4  5  6  7  8  10  12  15  20  30  60 
Other schemes such as Simple9 [9] and Simple16 [10] use words of 32 bits. (Simple9 and Simple16 can also be written as S9 and S16 [10].) While these schemes may sometimes compress slightly better, they are generally slower. Hence, we omitted them in our experiments. Unlike Simple8b that can encode integers in , Simple9 and Simple16 are restricted to integers in .
While Simple8b is not as fast as Variable Byte during encoding, it is still faster than many alternatives. Because the decoding step can be implemented efficiently (with little branching), we also get a good decoding speed while achieving a better compression ratio than Variable Byte.
2.6 Binary Packing
Binary packing is closely related to FrameOfReference (FOR) from Goldstein et al. [39] and tuple differential coding from Ng and Ravishankar [40]. In such techniques, arrays of values are partitioned into blocks (e.g., of 128 integers). The range of values in the blocks is first coded and then all values in the block are written in reference to the range of values: for example, if the values in a block are integers in the range , then they can be stored using 7 bits per integer () as offsets from the number 1000 stored in the binary notation. In our approach to binary packing, we assume that integers are small, so we only need to code a bit width per block (to represent the range). Then, successive values are stored using bits per integer using fast bit packing functions. Anh and Moffat called binary packing PackedBinary [25] whereas Delbru et al. [41] called their 128integer binary packing FOR and their 32integer binary packing AFOR1.
Binary packing can have a competitive compression ratio. In Appendix A, we derive a general informationtheoretic lower bound on the compression ratio of binary packing.
2.7 Binary Packing with variablelength blocks
Three factors determine to the storage cost of a given block in binary packing:

the number of bits () required to store the largest integer value in the binary notation,

the block length (),

and a fixed perblock overhead ().
The total storage cost for one block is . Binary packing uses fixedlength blocks (e.g., or ).
We can vary dynamically the length of the blocks to improve the compression ratio. This adds a small overhead to each block because we need to store not only the corresponding bit width () but also the block length (). We then have a conventional optimization problem: we must partition the array into blocks so that the total storage cost is minimized. The cost of each block is still given by , but the block length may vary from one block to another.
The dynamic selection of block length was first proposed by Deveaux et al. [42] who reported compression gains (15–30%). They used both a topdown and a bottomup heuristic.
Delbru et al. [41] also implemented two such adaptive solutions, AFOR2 and AFOR3. AFOR2 picks blocks of length 8, 16, 32 whereas AFOR3 adds a special processing for the case where we have successive integers. To determine the best configuration of blocks, they pick 32 integers and try various configurations (1 block of 32 integers, 2 blocks of 16 integers and so on). They keep the configuration minimizing the storage cost. In effect, they apply a greedy approach to the storage minimization problem.
Silvestri and Venturini [24] proposed two variablelength schemes, and we selected their fastest version (henceforth VSEncoding). VSEncoding optimizes the block length using dynamic programming over blocks of lengths 1–14, 16, 32. That is, given the integer logarithm of every integer in the array, VSEncoding finds a partition truly minimizing the total storage cost. We expect VSEncoding to provide a superior compression ratio compared to AFOR2 and AFOR3.
2.8 Patched coding
Binary packing might sometimes compress poorly. For example, the integers can be stored using slightly more than 8 bits per integer with binary packing. However, the same sequence with one large value, e.g., , is no longer so compressible: at least bits per integer are required. Indeed, 32 bits are required for storing in the binary notation and all integers use the same bit width under binary packing.
To alleviate this problem, Zukowski et al. [23] proposed patching: we use a small bit width , but store exceptions (values greater than or equal to ) in a separate location. They called this approach PFOR. (It is sometimes written PFD [43], PFor or PForDelta when used in conjunction with differential coding.) We begin with a partition of the input array into subarrays that have a fixed maximal size (e.g., 32 MB). We call each such subarray a page.
A single bit width is used for an entire page in PFOR. To determine the best bit width during encoding, a sample of at most integers is created out of the page. Then, various bit widths are tested until the best compression ratio is achieved. In practice, to accelerate the computation, we can construct a histogram, recording how many integers have a given integer logarithm ().
A page is coded in blocks of 128 integers, with a separate storage array for the exceptions. The blocks are coded using bit packing. We either pack the integer value itself when the value is regular (), or an integer offset pointing to the next exception in the block of 128 integers when there is one. The offset is the difference between the index of the next exception and the index of the current exception, minus one. For the purpose of bit packing, we store integer values and offsets in the same array without differentiating them. For example, consider the following array of integers in the binary notation:
Assume that the bit width is set to three (), then we have exceptions at positions , the offsets are In the binary notation we have and , so we would store
The bitpacked blocks are preceded by a 32bit word containing two markers. The first marker indicates the location of the first exception in the block of 128 integers (4 in our example), and the second marker indicates the location of this first exception value in the array of exceptions (exception table).
Effectively, exception locations are stored using a linked list: we first read the location of the first exception, then going to this location we find an offset from which we retrieve the location of the next exception, and so on. If the bit width is too small to store an offset value, that is, if the offset is greater or equal than , we have to create a compulsory exception inbetween. The location of the exception values themselves are found by incrementing the location of the first exception value in the exception table.
When there are too many exceptions, these exception tables may overflow and it is necessary to start a new page: Zukowski et al. [23] used pages of 32 MB. In our own experiments, we partition large arrays into arrays of at most integers (see § 6.2) so a single page is used in practice.
PFOR [23] does not compress the exception values. In an attempt to improve the compression, Zhang et al. [26] proposed to store the exception values using either 8, 16, or 32 bits. We implemented this approach (henceforth PFOR2008). (See Table III.)
2.8.1 NewPFD and OptPFD
The compression ratios of PFOR and PFOR2008 are relatively modest (see § 6). For example, we found that they fare worse than binary packing over blocks of 32 integers (BP32). To get better compression, Yang et al. [10] proposed two new schemes called NewPFD and OptPFD. (NewPFD is sometimes called NewPFOR [44, 45] whereas OptPFD is also known as OPTP4D [24].) Instead of using a single bit width per page, they use a bit width per block of 128 integers. They avoid wasteful compulsory exceptions: instead of storing exception offsets in the bit packed blocks, they store the first bits of the exceptional integer value. For example, given the following array
and a bit width of 3 (), we would pack
For each block of 128 integers, the higher bits of the exception values ( in our example) as well as their locations (e.g., ) are compressed using Simple16. (We tried replacing Simple16 with Simple8b but we found no benefit.)
Each block of 128 coded integers is preceded by a 32bit word used to store the bit width, the number of exceptions, and the storage requirement of the compressed exception values in 32bit words. NewPFD determines the bit width by picking the smallest value of such that not more than 10% of the integers are exceptions. OptPFD picks the value of maximizing the compression. To accelerate the processing, the bit width is chosen among the integer values 0–16, 20 and 32.
compulsory  bit width  exceptions  compressed exceptions  
PFOR [23]  yes  per page  per page  no 

PFOR2008 [26]  yes  per page  per page  8, 16, 32 bits 
NewPFD/OptPFD [10]  no  per block  per block  Simple16 
FastPFOR (§ 5)  no  per block  per page  binary packing 
SIMDFastPFOR (§ 5)  no  per block  per page  vectorized bin. pack. 
SimplePFOR (§ 5)  no  per block  per page  Simple8b 
Ao et al. [43] also proposed a version of PFOR called ParaPFD. Though it has a worse compression efficiency than NewPFD or PFOR, it is designed for fast execution on graphical processing units (GPUs).
3 Fast differential coding and decoding
All of the schemes we consider experimentally rely on differential coding over 32bit unsigned integers. The computation of the differences (or deltas) is typically considered a trivial operation which accounts for only a negligible fraction of the total decoding time. Consequently, authors do not discuss it. But in our experience, a straightforward implementation of differential decoding can be four times slower than the decompression of small integers.
We have implemented and evaluated two approaches to data differencing:

The standard form of differential coding is simple and requires merely one subtraction per value during encoding () and one addition per value during decoding to effectively compute the prefix sum ().

A vectorized differential coding leaves the first four elements unmodified. From each of the remaining elements with index , we subtract the element with the index : . In other words, the original array () is converted into (). An advantage of this approach is that we can compute four differences using a single SIMD operation. This operation carries out an elementwise subtraction for two fourelement vectors. The decoding part is symmetric and involves the addition of the element : . Again, we can use a single SIMD instruction to carry out four additions simultaneously.
We can get a speed of 2000 mis or 1.7 cycles/int with the standard differential decoding (the first approach) by manually unrolling the loops. Clearly, it is impossible to decode compressed integers at more than 2 billion integers per second if the computation of the prefix sum itself runs at 2 billion integers per second. Hence, we implemented a vectorized version of differential coding. Vectorized differential decoding is much faster (5000 mis vs. 2000 mis). However, it comes at a price: vectorized deltas are, on average, four times larger which increases the storage cost by up to 2 bits (e.g., see Table V).
To prevent memory bandwidth from becoming a bottleneck [1, 2, 3, 4, 5], we prefer to compute differential coding and decoding in place. To this end, we compute deltas in decreasing index order, starting from the largest index. For example, given the integers 1, 4, 13, we first compute the difference between 13 and 4 which we store in last position (1, 4, 9), then we compute the difference between 4 and 1 which we store in second position (1, 3, 9). In contrast, the differential decoding proceeds in the increasing index order, starting from the beginning of the array. Starting from 1, 3, 9, we first add 3 and 4 which we store in the second position (1, 4, 9), then we add 4 and 9 which we store in the last position (1, 4, 9). Further, our implementation requires two passes: one pass to reconstruct the deltas from their compressed format and another pass to compute the prefix sum (§ 6.2). To improve data locality and reduce cache misses, arrays containing more than integers () are broken down into smaller arrays and each array is decompressed independently. Experiments with synthetic data have shown that reducing cache misses by breaking down arrays can lead to nearly a significant improvement in decoding speed for some schemes without degrading the compression efficiency.
4 Fast bit unpacking
Bit packing is a process of encoding small integers in using bits each: can be arbitrary and not just 8, 16, 32 or 64. Each number is written using a string of exactly bits. Bit strings of fixed size are concatenated together into a single bit string, which can span several 32bit words. If some integer is too small to use all bits, it is padded with zeros.
Languages like C and C++ support the concept of bit packing through bit fields. An example of two C/C++ structures with bit fields is given in Fig. 5. Each structure in this example stores 8 small integers. The structure Fields4_8 uses 4 bits per integer (), while the structure Fields5_8 uses 5 bits per integer ().
Assuming that bit fields in these structures are stored compactly, i.e., without gaps, and the order of the bit fields is preserved, the 8 integers are stored in the memory as shown in Fig. 6. If any bits remain unused, their values can be arbitrary. All small integers on the left panel in Fig. 6 fit into a single 32bit word. However, the integers on the right panel require two 32bit words with 24 bits remaining unused (these bits can be arbitrary). The field of the 7 integer crosses the 32bit word boundary: the first 2 bits use bits 30–31 of the first words, while the remaining 3 bits occupy bits 0–2 of the second word (bits are enumerated starting from zero).
Unfortunately, language implementers are not required to ensure that the data is fully packed. For example, the C language specification states that whether a bitfield that does not fit is put into the next unit or overlaps adjacent units is implementationdefined [46]. Most importantly, they do not have to provide packing and unpacking routines that are optimally fast. Hence, we implemented bit packing and unpacking using our own procedures as proposed by Zukowski et al. [23]. In Fig. 7, we give C/C++ implementations of such procedures assuming that fields are laid out as depicted in Fig. 6. The packing procedures can be implemented similarly and we omit them for simplicity of exposition.
In some cases, we use bit packing even though some integers are larger than (see § 2.8). In effect, we want to pack only the first bits of each integer, which can be implemented by applying a bitwise logical and operation with the mask on each integer. These extra steps slow down the bit packing (see § 6.3).
The procedure unpack4_8 decodes eight 4bit integers. Because these integers are tightly packed, they occupy exactly one 32bit word. Given that this word is already loaded in a register, each integer can be extracted using at most four simple operations (shift, mask, store, and pointer increment). Unpacking is efficient because it does not involve any branching.
The procedure unpack5_8 decodes eight 5bit integers. This case is more complicated, because the packed representation uses two words: the field for the 7 integer crosses word boundaries. The first two (lower order) bits of this integer are stored in the first word, while the remaining three (higher order) bits are stored in the second word. Decoding does not involve any branches and most integers are extracted using four simple operations.
The procedures unpack4_8 and unpack5_8 are merely examples. Separate procedures are required for each bit width (not just 4 and 5).
Decoding routines unpack4_8 and unpack5_8 operate on scalar 32bit values. An effective way to improve performance of these routines involves vectorization [14, 47]. Consider listings in Fig. 7 and assume that in and out are pointers to element vectors instead of scalars. Further, assume that scalar operators (shifts, assignments, and bitwise logical operations) are vectorized. For example, a bitwise shift is applied to all vector elements at the same time. Then, a single call to unpack5_8 or unpack4_8 decodes rather than just eight integers.
Recent x86 processors have SIMD instructions that operate on vectors of four 32bit integers () [48, 49, 50]. We can use these instructions to achieve a better decoding speed. A sample vectorized data layout for is given in Fig. 8. Integers are divided among series of four 32bit words in a roundrobin fashion. When a series of four words overflows, the data spills over to the next series of 32bit integers. In this example, the first 24 integers are stored in the first four words (the first row in Fig. 8), integers 25–28 are each split between different words, and the remaining integers 29–32 are stored in the second series of words (the second row of the Fig. 8).
These data can be processed using a vectorized version of the procedure unpack5_8, which is obtained from unpack5_8 by replacing scalar operations with respective SIMD instructions. With Microsoft, Intel or GNU GCC compilers, we can almost mechanically go from the scalar procedure to the vectorized one by replacing each C operator with the equivalent SSE2 intrinsic function:

the bitwise logical and (&) becomes _mm_and_si128,

the right shift () becomes _mm_srli_epi32,

and the left shift () becomes _mm_slli_epi32.
Indeed, compare procedure unpack5_8 from Fig. 5 with procedure SIMDunpack5_8 from Fig. 7. The intrinsic functions serve the same purpose as the C operators except that they work on vectors of 4 integers instead of single integers: e.g., the function _mm_srli_epi32 shifts 4 integers at once. The functions _mm_load_si128 and _mm_store_si128 load a register from memory and write the content of a register to memory respectively; the function _mm_set1_epi32 creates a vector of 4 integers initialized with a single integer (e.g., 31 becomes 31,31,31,31).
In the beginning of the vectorized procedure the pointer in points to the first 128bit chunk of data displayed in row one of the Fig. 8. The first shiftandmask operation extracts 4 small integers at once. Then, these integers are written to the target buffer using a single 128bit SIMD store operation. The shiftandmask is repeated until we extract the first 24 numbers and the first two bits of the integers 25–28. At this point the unpack procedure increases the pointer in and loads the next 128bit chunk into a register. Using an additional mask operation, it extracts the remaining 3 bits of integers 25–28. These bits are combined with already obtained first 2 bits (for each of the integers 25–28). Finally, we store integers 25–28 and finish processing the second 128bit chunk by extracting numbers 29–32.
Our vectorized data layout is interleaved. That is, the first four integers (Int 1, Int 2, Int 3, and Int 4 in Fig. 8) are packed into 4 different 32bit words. The first integer is immediately adjacent to the fifth integer (Int 5). Schlegel et al. [34] called this model vertical. Instead we could ensure that the integers are packed sequentially (e.g. Int 1, Int 2, and Int 3 could be stored in the same 32bit word). Schlegel et al. called this alternative model horizontal and it is used by Willhalm et al. [47]. In their scheme, decoding relies on the SSSE3 shuffle operation pshufb (like varintG8IU). After we determine the bit width of integers in the block, one decoding step typically includes the following operations:

Loading data into the source 16byte buffer (this step may require a 16byte alignment).

Distributing 3–4 integers stored in the source buffer among four 32bit words of the target buffer. This step, which requires loading a shuffle mask, is illustrated by Fig. 10 (for 5bit integers). Note that unlike varintG8IU, the integers in the source buffer are not necessarily aligned by byte boundaries (unless is 8, 16, or 32). Hence, after the shuffle operation, (1) the integers copied the target buffer may not be aligned on boundaries of 32bit words, and (2) 32bit words may contain some extra bits that do not belong to the integers of interest.

Aligning integers on bit boundaries, which may require shifting several integers to the right. Because the x86 platform currently lacks a SIMD shift that has four different shift amounts, this step is simulated via two operations: a SIMD multiplication by four different integers using the SSE4.1 instruction pmulld and a subsequent vectorized right shift.

Zeroing bits that do not belong to the integers of interest. This requires a mask operation.

Storing the target buffer.
Overall, Willhalm et al. [47] require SSE4.1 for their horizontal bit packing whereas efficient bit packing using a vertical layout only requires SSE2.
We compare experimentally vertical and horizontal bit packing in § 6.3.
5 Novel schemes: SIMDFastPFOR, FastPFOR and SimplePFOR
Patched schemes compress arrays broken down into pages (e.g., thousands or millions of integers). Pages themselves may be broken down into small blocks (e.g., 128 integers). While the original patched coding scheme (PFOR) stores exceptions on a per page basis, newer alternatives such as NewPFD and OptPFD store exceptions on a per block basis (see Table III). Also, PFOR picks a single bit width for an entire page, whereas NewPFD and OptPFD may choose a separate bit width for each block.
The net result is that NewPFD compresses better than PFOR, but PFOR is faster than NewPFD. We would prefer a scheme that compresses as well as NewPFD but with the speed of PFOR. For this purpose, we propose two new schemes: FastPFOR and SimplePFOR. Instead of compressing the exceptions on a per block basis like NewPFD and OptPFD, FastPFOR and SimplePFOR store the exceptions on a per page basis, which is similar to PFOR. However, like NewPFD and OptPFD, they pick a new bit width for each block.
To explain FastPFOR and SimplePFOR, we consider an example. For simplicity, we only use 16 integers (to be encoded). In the binary notation these numbers are:
The maximal number of bits used by an integer is 6 (e.g., because of 100000). So we can store the data using 6 bits per value plus some overhead. However, we might be able to do better by allowing exceptions in the spirit of patched coding. Assume that we store the location of any exception using a byte (8 bits): in our implementation, we use blocks of 128 integers so that this is not a wasteful choice.
We want to pick , the actual number of bits we use. That is, we store the lowest bits of each value. If a value uses bits, then we somehow need to store the extra bits as an exception. We propose to use the difference (i.e., ) between the maximal bit width and the number of bits allocated per truncated integer to estimate the cost of storing an exception. This is a heuristic since we use slightly more in practice (to compress the highest bits of exception values). Because we store exception locations using 8 bits, we estimate the cost of storing each exception as bits. We want to choose so that is minimized where is the number of exceptions corresponding to the value . (In our software, we store blocks of 128 integers so that the formula would be .)
We still need to compute the number of exceptions as a function of the bit width in a given block of integers. For this purpose, we build a histogram that tells us how many integers have a given bit width. In software, this can be implemented as an array of 33 integers: one integer for each possible bit width from 0 to 32. Creating the histogram requires the computation of the integer logarithm () of every single integer to be coded. From this histogram, we can quickly determine the value that minimizes the expected storage simply by trying every possible value of . Looking at our data, we have 3 integers using 1 bit, 10 integers using 2 bits, and 3 integers using 6 bits. So, if we set , we get exceptions; for , we get ; and for , . The corresponding costs () are 185, 68, and 96. So, in this case, we choose . We therefore have 3 exceptions (100110, 100000, 110100).
A compressed page begins with a 32bit integer. Initially, this 32bit integer is left uninitialized: we come back to it later. Next, we first store the values themselves, with the restriction that we use only lowest bits of each value. In our example, the data corresponding to the block is
These truncated values are stored continuously, one block after the other (see Fig. 11). Different blocks may use different values of , but because is always divisible by , the truncated values for a given block can be stored at a memory address that is 32bit aligned.
During the encoding of a compressed page, we write to a temporary byte array. The byte array contains different types of information. For each block, we store the number of bits allocated for each truncated integer (i.e., ) and the maximum number of bits any actual, i.e., nontruncated, value may use. If the maximal bit width is greater than the number of allocated bits , we store a counter indicating the number of exceptions. We also store the exception locations within the block as integers in . In contrast with schemes such as NewPFD or OptPFD, we do not attempt to compress these numbers and simply store them using one byte each. Each value is already represented concisely using only one byte as opposed to using a 32bit integer or worse.
When all integers of a page have been processed and bitpacked, the temporary byte array is stored right after the truncated integers, preceded with a 32bit counter indicating its size. We pad the byte array with zeros so that the number of bytes is divisible by 4 (allowing a 32bit memory alignment). Then, we go back to the beginning of the compressed page, where we had left an uninitialized 32bit integer, and we write there the offset of the byte array within the compressed page. This ensures that during decoding we can locate the byte array immediately. The initial 32bit integer and the 32bit counter preceding the byte array add a fixed overhead of 8 bytes per page: it is typically negligible because it is shared by many blocks, often spanning thousands of integers.
In our example, we write 16 truncated integers using bits each, for the total of 4 bytes (32 bits). In the byte array, we store:

the number of bits () allocated per truncated integer using one byte;

the maximal bit width (6) using a byte;

the number of exceptions (again using one byte);

locations of the exceptions (4,9,11) using one byte each.
Thus, for this block alone, we use 3+4+3=10 bytes (80 bits).
Finally, we must store the highest bits of each exception: 1001, 1000, 1101. They are stored right after the byte array. Because the offset of the byte array within the page is stored at the beginning of the page, and because the byte array is stored with a header indicating its length, we can locate the exceptions quickly during decoding. The exceptions are stored on a per page basis in compressed form. This is in contrast to schemes such as OptPFD and NewPFD where exceptions are stored on a perblock basis, interleaved with the truncated values.
SimplePFOR and FastPFOR differ in how they compress high bits of the exception values:

In the SimplePFOR scheme, we collect all these values (e.g., such as 1001, 1000, 1101) in one 32bit array and we compress them using Simple8b. We apply Simple8b only once per compressed page.

In the FastPFOR scheme, we store exceptions in one of 32 arrays, one for each possible bit width (from 1 to 32). When encoding a block, the difference between the maximal bit width and determines in which array the exceptions are stored. Each of the 32 arrays is then bit packed using the corresponding bit width. Arrays are padded so that their length is a multiple of 32 integers.
In our example, the 3 values corresponding to the high bits of exceptions (1001, 1000, 1101) would be stored in the fourth array and bitpacked using 4 bits per value.
In practice, we store the 32 arrays as follows. We start with a 32bit bitset: each bit of the bitset corresponds to one array. The bit is set to true if the array is not empty and to false otherwise. Then all nonempty bitpacked arrays are stored in sequence. Each bitpacked array is preceded by a 32bit counter indicating its length.
In all other aspects, SimplePFOR and FastPFOR are identical.
These schemes provide effective compression even though they were designed for speed. Indeed, suppose we could compress the highest bits of 3 exceptions of our example (1001, 1000, 1101) using only 4 bits each. For this block alone, we use 32 bits for the truncated data, 48 bits in the byte array plus 12 bits for the values of the exceptions. The total would be 92 bits to store the 16 original integers, or 5.75 bits/int. This compares favorably to maximal bit width of these integers (6). In our implementation, we use blocks of 128 integers instead of only 16 integers so that good compression is more likely.
During decoding, the exceptions are first decoded in bulk. To ensure that we do not overwhelm the CPU cache, we process the data in pages of integers. We then unpack the integers and apply patching on a per block basis. The exceptions locations do not need any particular decoding: they are read byte by byte.
Though SimplePFOR and FastPFOR are similar in design to NewPFD and OptPFD, we find that they offer better coding and decoding speed. In our tests (see § 6), FastPFOR and SimplePFOR encode integers about twice as fast as NewPFD. It is an indication that compressing exceptions in bulk is faster.
Data to be compressed:  …10, 10, 1, 10, 100110, 10, 1, 11, 10, 100000, 10, 110100, 10, 11, 11, 1… 

Truncated data:  
( bits)  …10, 10, 01, 10, 10, 10, 01, 11, 10, 00, 10, 00, 10, 11, 11, 01 … 
Byte array:  
( bits)  …2, 6, 3, 4, 9, 11 … 
Exception data:  
(to be compressed)  …1001, 1000, 1101 … 
We also designed a new scheme, SIMDFastPFOR: it is identical to FastPFOR except that it packs relies on vectorized bit packing for the truncated integers and the high bits of the exception values. The compression ratio is slightly diminished for two reasons:

The 32 exception arrays are padded so that their length is a multiple of 128 integers, instead of 32 integers.

We insert some padding prior to storing bit packing data so that alignment on 128bit boundaries is preserved.
This padding adds an overhead of about 0.3–0.4 bits per integer (see Table V).
6 Experiments
The goal of our experiments is to evaluate the best known integer encoding methods. The first series of our test in § 6.4 is based on synthetic data sets first presented by Anh and Moffat [25]: ClusterData and Uniform. They have the benefit that they can be quickly implemented, thus helping reproducibility. We then confirm our results in § 6.5 using large realistic data sets based on TREC collections ClueWeb09 and GOV2.
6.1 Hardware
We carried out our experiments on a Linux server equipped with Intel Core i7 2600 (3.40 GHz, 8192 KB of L3 CPU cache) and 16 GB of RAM. The DDR31333 RAM with dual channel has a transfer rate of 20,000 MB/s or 5300 mis. According to our tests, it can copy arrays at a rate of 2270 mis with the C function memcpy.
6.2 Software
We implemented our algorithms in C++ using GNU GCC 4.7. We use the optimization flag O3. Because the varintG8IU scheme requires SSSE3 instructions, we had to add the flag mssse3. When compiling our implementation of Willhalm et al. [47] bit unpacking, we had to use the flag msse4.1 since it requires SSE4 instructions. Our complete source code is available online.^{†}^{†}†https://github.com/lemire/FastPFOR
Following Stepanov et al. [12], we compute speed based on the wallclock inmemory processing. Wallclock times include the time necessary for differential coding and decoding. During our tests, we do not retrieve or store data on disk: it is impossible to decode billions of integers per second when they are kept on disk.
Arrays containing more than integers (256 KB) are broken down into smaller chunks. Each chunk is decoded into two passes. In the first pass, we decompress deltas and store each delta value using a 32bit word. In the second pass, we carry out an inplace computation of prefix sums. As noted in § 3, this approach greatly improves data locality and leads to a significant improvement in decoding speed for the fastest schemes.
Our implementation of VSEncoding, NewPFD, and OptPFD is based on software published by Silvestri and Venturini [24]. They report that their implementation of OptPFD was validated against an implementation provided by the original authors [10]. We implemented varintG8IU from Stepanov et al. [12] as well as Simple8b from Anh and Moffat [25]. To minimize branching, we implemented Simple8b using a C++ switch case that selects one of 16 functions, that is, one for each selector value. Using a function for each selector value instead of a single function is faster because loop unrolling eliminates branching. (Anh and Moffat [25] referred to this optimization as bulk unpacking.) We also implemented the original PFOR scheme from Zukowski et al. [23] as well as its successor PFOR2008 from Zhang et al. [26]. Zukowski et al. made a distinction between PFOR and PFORDelta: we effectively use FORDelta since we apply PFOR to deltas.
Reading and writing unaligned data can be as fast as reading and writing aligned data on recent Intel processors—as long as we do not cross a 64byte cache line. Nevertheless, we still wish to align data on 32bit boundaries when using regular binary packing. Each block of 32 bitpacked integers should be preceded by a descriptor that stores the bit width () of integers in the block. The number of bits used by the block is always divisible by 32. Hence, to keep blocks aligned on 32bit boundaries, we group the blocks and respective descriptors into metablocks each of which contains 4 successive blocks. A metablock is preceded by a 32bit descriptor that combines 4 bit widths (8 bits per width). We call this scheme BP32. We also experimented with versions of binary packing on fewer integers (8 integers and 16 integers). Because these versions are slower, we omit them from our experiments.
We also implemented a vectorized binary packing over blocks of 128 integers (henceforth SIMDBP128). Similar to regular binary packing, we want to keep the blocks aligned on 128bit boundaries when using vectorized binary packing. To this end, we regroup 16 blocks into a metablock of 2048 integers. As in BP32, the encoded representation of a metablock is preceded by a 128bit descriptor word keeping bit widths (8 bits per width).
In summary, the format of our binary packing schemes is as follows:

SIMDBP128 combines 16 blocks of 128 integers whereas BP32 combines 4 blocks of 32 integers.

SIMDBP128 employs (vertical) vectorized bit packing whereas BP32 relies on the regular bit packing as described in §4.
Many schemes such as BP32 and SIMDBP128 require the computation of the integer logarithm during encoding. If done naively, this step can take up most of the running time: the computation of the integer logarithm is slower than a fast operation such as a shift or an addition. We found it best to use the bit scan reverse (bsr) assembly instruction on x86 platforms (as it provides whenever ).
For the binary packing schemes, we must determine the maximum of the integer logarithm of the integers () during encoding. Instead on computing one integer logarithm per integer, we carry out a bitwise logical or on all the integers and compute the integer logarithm of the result. This shortcut is possible due to the equation: where refers to the bitwise logical or.
Some schemes compress data in blocks of fixed length (e.g., 128 integers). We compress the remainder using Variable Byte as in Zhang et al. [26]. In our tests, most arrays are large compared to the block size. Thus, replacing Variable Byte by another scheme would make no or little difference.
Speeds are reported in millions of 32bit integers per second (mis). Stepanov et al. report a speed of 1059 mis over the TREC GOV2 data set for their best scheme varintG8IU. We got a similar speed (1300 mis).
VSEncoding, FastPFOR, and SimplePFOR use buffers during compression and decompression proportional to the size of the array. VSEncoding uses a persistent buffer of over 256 KB. We implemented SIMDFastPFOR, FastPFOR, and SimplePFOR with a persistent buffer of slightly more than 64 KB. PFOR, PFOR2008, NewPFD, and OptPFD are implemented using persistent buffers proportional to the block size (128 integers in our tests): less than 512 KB in persistent buffer memory are used for each scheme. Both PFOR and PFOR2008 use pages of integers or 256 KB. During compression, PFOR, PFOR2008, SIMDFastPFOR, FastPFOR, and SimplePFOR use a buffer to store exceptions. These buffers are limited by the size of the pages and they are released immediately after decoding or encoding an array.
The implementation of VSEncoding [24] uses some SSE2 instructions through assembly during bit unpacking. VarintG8IU makes explicit use of SSSE3 instructions through SIMD intrinsic functions whereas SIMDFastPFOR and SIMDBP128 similarly use SSE2 intrinsic functions.
Though we tested vectorized differential coding with all schemes, we only report results for schemes that make explicit use of SIMD instructions (SIMDFastPFOR, SIMDBP128, and varintG8IU). To ensure fast vector processing, we align all initial pointers on 16byte boundaries.
6.3 Computing bit packing
We implemented bit packing using handtuned functions as originally proposed by Zukowski et al. [23]. Given a bit width , a sequence of unsigned 32bit integers are coded to integers. In our tests, we used for the regular version, and for the vectorized version.
Fig. 15 illustrates the speed at which we can pack and unpack integers using blocks of 32 integers. In some schemes, it is known that all integers are no larger than , while in patching schemes there are exceptions, i.e., integers larger than or equal to . In the latter case, we enforce that integers are smaller than through the application of a mask. This operation slows down compression.
We can pack and unpack much faster when the number of bits is small because less data needs to be retrieved from RAM. Also, we can pack and unpack faster when the bit width is 4, 8, 16, 24 or 32. Packing and unpacking with bit widths of 8 and 16 is especially fast.
The vectorized version (Fig. (b)b) is roughly twice as fast as the scalar version. We can unpack integers having a bit width of 8 or less at a rate of 6000 mis. However, it carries the implicit constraint that integers must be packed and unpacked in blocks of at least 128 integers. Packing is slightly faster when the bit width is 8 or 16.
In Fig. (b)b only, we report the unpacking speed when using the horizontal data layout as described by Willhalm et al. [47] (see § 4). When the bit widths range from 16 to 26, the vertical and horizontal techniques have the same speed. For small () or large () bit widths, our approach based on a vertical layout is preferable as it is up to 70% faster. Accordingly, all integer coding schemes are implemented using the vertical layout.
We also experimented with the cases where we pack fewer integers ( or ). However, it is slower and a few bits remain unused ().
6.4 Synthetic data sets
We used the ClusterData and the Uniform model from Anh and Moffat [25]. These models generate sets of distinct integers that we keep in sorted order. In the Uniform model, integers follow a uniform distribution whereas in the ClusterData model, integer values tend to cluster. That is, we are more likely to get long sequences of similar values. The goal of the ClusterData model is to simulate more realistically data encountered in practice. We expect data obtained from the ClusterData model to be more compressible.
We generated data sets of random integers in the range with both the ClusterData and the Uniform model. In the first pass, we generated short arrays containing integers each. The average difference between successive integers within an array is thus . We expect the compressed data to use at least 14 bits/int. In the second pass, we generated a single long array of integers. In this case, the average distance between successive integers is : we expect the compressed data to use at least 4 bits/int.
The results are given in Table IV (schemes with a by their name, e.g., SIMDFastPFOR, use vectorized differential coding). Over short arrays, we see little compression as expected. There is also a relatively little difference in compression efficiency between Variable Byte and a more spaceefficient alternative such as FastPFOR. However, speed differences are large: the decoding speed ranges from 220 mis for Variable Byte to 2500 mis for SIMDBP128.
For long arrays, there is a greater difference between the compression efficiencies. The schemes with the best compression ratios are SIMDFastPFOR, FastPFOR, SimplePFOR, Simple8b, OptPFD. Among those, SIMDFastPFOR is the clear winner in terms of decoding speed. The good compression ratio of OptPFD comes at a price: it has one of the worst encoding speeds. In fact, it is 20–50 times slower than SIMDFastPFOR during encoding.
Though they differ significantly in implementation, FastPFOR, SimplePFOR, and SIMDFastPFOR have equally good compression ratios. All three schemes have similar decoding speeds, but SIMDFastPFOR decodes integers much faster than FastPFOR and SimplePFOR.
In general, encoding speeds vary significantly, but binary packing schemes are the fastest, especially when they are vectorized. Better implementations could possibly help reduce this gap.
The version of SIMDBP128 using vectorized differential coding (written SIMDBP128) is always 400 mis faster during decoding than any other alternative. Though it does not always offer the best compression ratio, it always matches the compression ratio of Variable Byte.
The difference between using vectorized differential coding and regular differential coding could amount to up to 2 bits per integer. Yet, typically, the difference is less than 2 bits. For example, SIMDBP128 only uses about one extra bit per integer when compared with SIMDBP128. The cost of binary packing is determined by the largest delta in a block: increasing the average size of the deltas by a factor of 4 does not necessarily lead to a fourfold increase in the expected largest integer (in a block of 128 deltas).
Compared to our novel schemes, performance of varintG8IU is unimpressive. However, variantG8IU is about 60% faster than Variable Byte while providing a similar compression efficiency. It is also faster than Simple8b, though Simple8b has a better compression efficiency. The version with vectorized differential coding (written varintG8IU) has poor compression over the short arrays compared with the regular version (varintG8IU). Otherwise, on long arrays, varintG8IU is significantly faster (from 1300 mis to 1600 mis) than varintG8IU while compressing just as well.
There is little difference between PFOR and PFOR2008 except that PFOR offers a significantly faster encoding speed. Among all the schemes taken from the literature, PFOR and PFOR2008 have the best decoding speed in these tests: they use a single bit width for all blocks, determined once at the beginning of the compression. However, they are dominated in all metrics (coding speed, decoding speed and compression ratio) by SIMDBP128 and SIMDFastPFOR.
For comparison, we tested Google Snappy (version 1.0.5) as a delta compression technique. Google Snappy is a freely available library used internally by Google in its database engines [17]. We believe that it is competitive with other fast generic compression libraries such as zlib or LZO. For short ClusterData arrays, we got a decoding speed of 340 mis and almost no compression (29 bits/int.). For long ClusterData arrays, we got a decoding speed of 200 mis and 14 bits/int. Overall, Google Snappy has about half the compression efficiency of SIMDBP128 while being an order of magnitude slower.




6.5 Realistic data sets
The posting list of a word is an array of document identifiers where the word occurs. For more realistic data sets, we used posting lists obtained from two TREC Web collections. Our data sets include only document identifiers, but not positions of words in documents. For our purposes, we do not store the words or the documents themselves, just the posting lists.
The first data set is a posting list collection extracted from the ClueWeb09 (Category B) data set [51]. The second data set is a posting list collection built from the GOV2 data set by Silvestri and Venturini [24]. The GOV2 is a crawl of the .gov sites, which contains 25 million HTML, text, and PDF documents (the latter are converted to text).
This ClueWeb09 collection is a more realistic HTML collection of about 50 million crawled HTML documents, mostly in English. It represents postings for one million most frequent words. Common stop words were excluded and different grammar forms of words were conflated. Documents were enumerated in the order they appear in source files, i.e., they were not reordered. Unlike GOV2, the ClueWeb09 crawl is not limited to any specific domain. Uncompressed, the posting lists from GOV2 and ClueWeb09 use 20 GB and 50 GB respectively.
We decomposed these data sets according to the array length, storing all arrays of lengths to consecutively. We applied differential coding on the arrays (integers are transformed to ) and computed the Shannon entropy () of the result. We estimate the probability of the integer value as the number of occurrences of divided by the number of integers. As Fig. 18 shows, longer arrays are more compressible. There are differences in entropy values between two collections (ClueWeb09 has about two extra bits, see Fig. (a)a), but these differences are much smaller than those among different array sizes. Fig. (b)b shows the distribution of array lengths and entropy values.
6.5.1 Results over different array lengths
We present results per array length for selected schemes in Fig. 27. Longer arrays are more compressible since the deltas, i.e., differences between adjacent elements, are smaller.
We see in Figs. (b)b and (f)f that all schemes compress the deltas within a factor of two of the Shannon entropy for short arrays. For long arrays however, the compression (compared to the Shannon entropy) becomes worse for all schemes. Yet many of them manage to remain within a factor of three of the Shannon entropy.
Integer compression schemes are better able to compress close to the Shannon entropy over ClueWeb09 (see Fig. (b)b) than over GOV2 (see Fig. (f)f). For example, SIMDFastPFOR, Simple8b, and OptPFD are within a factor of two of Shannon entropy over ClueWeb09 for all array lengths, whereas they all exceed three times the Shannon entropy over GOV2 for the longest arrays. Similarly, varintG8IU, SIMDBP128, and SIMDFastPFOR remain within a factor of six of the Shannon entropy over ClueWeb09 but they all exceed this factor over GOV2 for long arrays. In general, it might be easier to compress data close to the entropy when the entropy is larger.
We get poor results with varintG8IU over the longer (and more compressible) arrays (see Figs. (a)a and (e)e). We do not find this surprising because variantG8IU requires at least 9 bits/int. In effect, when other schemes such as SIMDFastPFOR and SIMDBP128 use less than bits/int, they surpass varintG8IU in both compression efficiency and decoding speed. However, when the storage exceeds 9 bits/int, VarintG8IU is one of the fastest methods available for these data sets. However, we also got poor results with variantG8IU on the ClusterData and Uniform data sets for short (and poorly compressible) arrays in § 6.4.
We see in Figs. (c)c and (g)g that both SIMDBP128 and SIMDBP128 have a significantly better encoding speed, irrespective of the array length. The opposite is true for OptPFD: it is much slower than the alternatives.
Examining the decoding speed as a function of array length (see Figs. (c)c and (g)g), we see that several schemes have a significantly worse decoding speed over short (and poorly compressible) arrays, but the effect is most pronounced for the new schemes we introduced (SIMDFastPFOR, SIMDFastPFOR, SIMDBP128, and SIMDBP128). Meanwhile, varintG8IU and Simple8b have a decoding speed that is less sensitive to the array length.
6.5.2 Aggregated results
Not all posting lists are equally likely to be retrieved by the search engine. As observed by Stepanov et al. [12], it is desirable to account for different term distributions in queries. Unfortunately, we do not know of an ideal approach to this problem. Nevertheless, to model more closely the performance of a major search engine, we used the AOL query log data set as a collection of query statistics [52, 53]. It consists of about 20 million web queries collected from 650 thousand users over three months: queries repeating within a single user session were ignored. When possible (in about 90% of all cases), we matched the query terms with posting lists in the ClueWeb09 data set and obtained term frequencies (see Fig. (b)b). This allowed us to estimate how often a posting list of length between to is likely to be retrieved for various values of . This gave us a weight vector that we use to aggregate our results.
We present aggregated results in Table V. The results are generally similar to what we obtained with synthetic data. The newly introduced schemes (SIMDBP128, SIMDFastPFOR, SIMDBP128, SIMDFastPFOR) still offer the best decoding speed. We find that varintG8IU is much faster than varintG8IU (1500 mis vs. 1300 mis over GOV2) even though the compression ratio is the same with a margin of 10%. PFOR and PFOR2008 offer a better compression than varintG8IU but at a reduced speed. However, we find that SIMDBP128 is preferable in every way to varintG8IU, varintG8IU, PFOR, and PFOR2008.


For some applications, decoding speed and compression ratios are the most important metrics. Whereas elsewhere we report the number of bits per integer , we can easily compute the compression ratio as . We plot both metrics for some competitive schemes (see Fig. 30). These plots suggest that the most competitive schemes are SIMDBP128, SIMDFastPFOR, SIMDBP128, SIMDFastPFOR, SimplePFOR, FastPFOR, Simple8b, and OptPFD depending on how much compression is desired. Fig. 30 also shows that to achieve decoding speeds higher than 1300 mis, we must choose between SIMDBP128, SIMDFastPFOR, and SIMDBP128.
Few research papers report encoding speed. Yet we find large differences: for example, VSEncoding and OptPFD are two orders of magnitude slower during encoding than our fastest schemes. If the compressed arrays are written to slow disks in a batch mode, such differences might be of little practical significance. However, for memorybased databases and network applications, slow encoding speeds could be a concern. For example, the output of a query might need to be compressed or we might need to index the data in real time [54]. Our SIMDBP128 and SIMDBP128 schemes have especially fast encoding.
Similarly to previous work [12, 24], in Table VI we report unweighted averages. The unweighted speed aggregates are equivalent to computing the average speed over all arrays—irrespective of their lengths. From the distribution of posting size logarithms in Fig. (b)b, one may conclude that weighted results should be similar to unweighted ones. These observations are supported by data in Table VI: the decoding speeds and compression ratios for both aggregation approaches differ by less than 15% with the weighted results presented in Table V.
We can compare the number of bits per integer in Table VI with an informationtheoretic limit. Indeed, the Shannon entropy for the deltas of ClueWeb09 is 5.5 bits/int whereas it is 3.6 for GOV2. Hence, OptPFD is within 16% of the entropy on ClueWeb09 whereas it is within 22% of the entropy on GOV2. Meanwhile, the faster SIMDFastPFOR is within 30% and 40% of the entropy for ClueWeb09 and GOV2. Our fastest scheme (SIMDBP128) compresses the deltas of GOV2 to twice the entropy. It does slightly better with ClueWeb09 ().


7 Discussion
We find that binary packing is both fast and space efficient. The vectorized binary packing (SIMDBP128) is our fastest scheme. While it has a lesser compression efficiency compared to Simple8b, it is more than 3 times faster during decoding. Moreover, in the worst case, a slower binary packing scheme (BP32) incurred a cost of only about 1.2 bits per integer compared to the patching scheme with the best compression ratio (OptPFD) while decoding nearly as fast (within 10%) as the fastest patching scheme (PFOR).
Yet only few authors considered binary packing schemes or its vectorized variants in the recent literature:

Delbru et al. [41] reported good results with a binary packing scheme similar to our BP32: in their experiments, it surpassed Simple8b as well as a patched scheme (PFOR2008).

Anh and Moffat [9] also reported good results with a binary packing scheme: in their tests, it decoded at least 50% faster than either Simple8b or PFOR2008. As a counterpart, they reported that their binary packing scheme had a poorer compression.

Schlegel et al. [34] proposed a scheme similar to SIMDBP128. This scheme (called gamma) uses a vertical data layout to store integers, like our SIMDBP128 and SIMDFastPFOR schemes. It essentially applies binary packing to tiny groups of integers (at most 4 elements). From our preliminary experiments, we learned that decoding integers in small groups is not efficient. This is also supported by results of Schlegel et al. [34]. Their fastest decoding speed, which does not include writing back to RAM, is only 1600 mis (Core i7920, 2.67 Ghz).

Willhalm et al. [47] used a vectorized binary packing like our SIMDBP128, but with a horizontal data layout instead of our vertical layout. The decoding algorithm relies on the shuffle instruction pshufb. Our experimental results suggest that our approach based on a vertical layout might be preferable (see Fig. (a)a): our implementation of bit unpacking over a vertical layout is sometimes between 50% to 70% faster than our reimplementation over a horizontal layout based on the work of Willhalm et al. [47].
This performance comparison depends on the quality of our software. Yet the speed of our reimplementation is comparable with the speed originally reported by Willhalm et al. [47, Fig. 11]: they report a speed of 3300 mis with a bit width of 6. In contrast, using our implementation of their algorithms, we got a speed above 4800 mis for the same bit width and a 20% higher clock speed on a more recent CPU architecture.
The approach described by Willhalm et al. might be more competitive on platforms with instructions for simultaneously shifting several values by different offsets (e.g., the vpsrld AVX2 instruction). Indeed, this must be otherwise emulated by multiplications by powers of two followed by shifting.
Vectorized bitpacking schemes are efficient: they encode/decode integers at speeds of 4000–8500 mis. Hence, the computation of deltas and prefix sums may become a major bottleneck. This bottleneck can be removed through vectorization of these operations (though at expense of poorer compression ratios in our case). We have not encountered this approach in the literature: perhaps, because for slower schemes the computation of the prefix sum accounts for a small fraction of total running time. In our implementation, to ease comparisons, we have separated differential decoding from data decompression: an integrated approach could be up to twice as fast in some cases. Moreover, we might be able improve the decoding speed and the compression ratios with better vectorized algorithms. There might also be alternatives to data differencing, which also permit vectorization, such as linear regression [43].
In our results, the original patched coding scheme (PFOR) is bested on all three metrics (compression ratio, coding and decoding speed) by a binary packing scheme (SIMDBP128). Similarly, a more recent fast patching scheme (NewPFD) is generally bested by another binary packing scheme (BP32). Indeed, though the compression ratio of NewPFD is up to 6% better on realistic data, NewPFD is at least 20% slower than BP32. Had we stopped our investigations there, we might have been tempted to conclude that patched coding is not a viable solution when decoding speed is the most important characteristic on desktop processors. However, we designed a new vectorized patching scheme SIMDFastPFOR. It shows that patching remains a fruitful strategy even when SIMD instructions are used. Indeed, it is faster than the SIMDbased varintG8IU while providing a much better compression ratio (by at least 35%). In fact, on realistic data, SIMDFastPFOR is better than BP32 on two key metrics: decoding speed and compression ratio (see Fig. 30).
In the future, we may expect increases in the arity of SIMD operations supported by commodity CPUs (e.g., with AVX) as well as in memory speeds (e.g., with DDR4 SDRAM). These future improvements could make our vectorized schemes even faster in comparison to their scalar counterparts. However, an increase in arity means an increase in the minimum block size. Yet, when we increase the size of the blocks in binary packing, we also make them less space efficient in the presence of outlier values. Consider that BP32 is significantly more space efficient than SIMDBP128 (e.g., 5.5 bits/int vs. 6.3 bits/int on GOV2).
Thankfully, the problem of outliers in large blocks can be solved through patching. Indeed, even though OptPFD uses the same block size as SIMDBP128, it offers significantly better compression (4.5 bits/int vs. 6.3 bits/int on GOV2). Thus, patching may be more useful for future computers—capable of processing larger vectors—than for current ones.
While our work focused on decoding speed, there is promise in directly processing data while still in compressed form, ideally by using vectorization [47]. We expect that conceptually simpler schemes (e.g., SIMDBP128) might have the advantage over relatively more sophisticated alternatives (e.g., SIMDFastPFOR) for this purpose.
Many of the fastest schemes use relatively large blocks (128 integers) that are decoded all at once. Yet not all queries require decoding the entire array. For example, consider the computation of intersections between sorted arrays. It is sometimes more efficient to use random access, especially when processing arrays with vastly different lengths [55, 56, 57]. If the data is stored in relatively large compressed blocks (e.g., 128 integers with SIMDBP128), the granularity of random access might be reduced (e.g., when implementing a skip list). Hence, we may end up having to scan many more integers than needed. However, blocks of 128 integers might not necessarily be an impediment to good performance. Indeed, Schlegel et al. [58] were able to accelerate the computation of intersections by a factor of 5 with vectorization using blocks of up to 65 536 integers.
8 Conclusion
We have presented new schemes that are up to twice as fast as the previously best available schemes in the literature while offering competitive compression ratios and encoding speed. This was achieved by vectorization of almost every step including differential decoding. To achieve both high speed and competitive compression ratios, we introduced a new patched scheme that stores exceptions in a way that permits a vectorization (SIMDFastPFOR).
In the future, we might seek to generalize our results over more varied architectures as well as to provide a greater range of tradeoffs between speed and compression ratio. Indeed, most commodity processors support vector processing (e.g., Intel, AMD, PowerPC, ARM). We might also want to consider adaptive schemes that compress more aggressively when the data is more compressible and optimize for speed otherwise. For example, one could use a scheme such as varintG8IU for less compressible arrays and SIMDBP128 for the more compressible ones. One could also use workloadaware compression: frequently accessed arrays could be optimized for decoding speed whereas least frequently accessed data could be optimized for high compression efficiency. Finally, we should consider more than just 32bit integers. For example, some popular search engines (e.g., Sphinx [59]) support 64bit document identifiers. We might consider an approach similar to Schlegel et al. [58] who decompose arrays of 32bit integers into blocks of 16bit integers.
ACKNOWLEDGEMENT
Our varintG8IU implementation is based on code by M. Caron. V. Volkov provided better loop unrolling for differential coding. P. Bannister provided a fast algorithm to compute the maximum of the integer logarithm of an array of integers. We are grateful to N. Kurz for his insights and for his review of the manuscript. We wish to thank the anonymous reviewers for their valuable comments.
References
 [1] Sebot J, DrachTemam N. Memory bandwidth: The true bottleneck of SIMD multimedia performance on a superscalar processor. EuroPar 2001 Parallel Processing, Lecture Notes in Computer Science, vol. 2150. Springer Berlin / Heidelberg, 2001; 439–447, doi:10.1007/3540446818˙63.
 [2] Drepper U. What every programmer should know about memory. http://www.akkadia.org/drepper/cpumemory.pdf [Last checked August 2012.] 2007.
 [3] Manegold S, Kersten ML, Boncz P. Database architecture evolution: mammals flourished long before dinosaurs became extinct. Proceedings of the VLDB Endowment Aug 2009; 2(2):1648–1653.
 [4] Zukowski M. Balancing vectorized query execution with bandwidthoptimized storage. PhD Thesis, Universiteit van Amsterdam 2009.
 [5] Harizopoulos S, Liang V, Abadi DJ, Madden S. Performance tradeoffs in readoptimized databases. Proceedings of the 32nd international conference on Very Large Data Bases, VLDB ’06, VLDB Endowment, 2006; 487–498.
 [6] Westmann T, Kossmann D, Helmer S, Moerkotte G. The implementation and performance of compressed databases. SIGMOD Record September 2000; 29(3):55–67, doi:10.1145/362084.362137â«.
 [7] Abadi D, Madden S, Ferreira M. Integrating compression and execution in columnoriented database systems. Proceedings of the 2006 ACM SIGMOD International Conference on Management of data, SIGMOD ’06, ACM: New York, NY, USA, 2006; 671–682, doi:10.1145/1142473.1142548.
 [8] Büttcher S, Clarke CLA. Index compression is good, especially for random access. Proceedings of the 16th ACM conference on Information and Knowledge Management, CIKM ’07, ACM: New York, NY, USA, 2007; 761–770, doi:10.1145/1321440.1321546.
 [9] Anh VN, Moffat A. Inverted index compression using wordaligned binary codes. Information Retrieval 2005; 8(1):151–166, doi:10.1023/B:INRT.0000048490.99518.5c.
 [10] Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering. Proceedings of the 18th International Conference on World Wide Web, WWW ’09, ACM: New York, NY, USA, 2009; 401–410, doi:10.1145/1526709.1526764.
 [11] Popov P. Basic optimizations: Talk at the YaC (Yet Another Conference) held by Yandex (in Russian). http://yac2011.yandex.com/archive2010/topics/ [Last checked Sept 2012.] 2010.
 [12] Stepanov AA, Gangolli AR, Rose DE, Ernst RJ, Oberoi PS. SIMDbased decoding of posting lists. Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, ACM: New York, NY, USA, 2011; 317–326, doi:10.1145/2063576.2063627.
 [13] Dean J. Challenges in building largescale information retrieval systems: invited talk. Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, ACM: New York, NY, USA, 2009; 1–1, doi:10.1145/1498759.1498761. Author’s slides: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09keynote.pdf [Last checked Dec. 2012.].
 [14] Lemke C, Sattler KU, Faerber F, Zeier A. Speeding up queries in column stores: a case for compression. Proceedings of the 12th International Conference on Data Warehousing and Knowledge Discovery, DaWaK’10, SpringerVerlag: Berlin, Heidelberg, 2010; 117–129, doi:10.1007/9783642151057˙10.
 [15] Binnig C, Hildenbrand S, Färber F. Dictionarybased orderpreserving string compression for main memory column stores. Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, ACM: New York, NY, USA, 2009; 283–296, doi:10.1145/1559845.1559877.
 [16] Poess M, Potapov D. Data compression in Oracle. VLDB’03, Proceedings of the 29th International Conference on Very Large Data Bases, Morgan Kaufmann: San Francisco, CA, USA, 2003; 937–947.
 [17] Hall A, Bachmann O, BÃ¼ssow R, Gänceanu S, Nunkesser M. Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment 2012; 5(11):1436–1446.
 [18] Raman V, Swart G. How to wring a table dry: entropy compression of relations and querying of compressed relations. Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, VLDB Endowment, 2006; 858–869.
 [19] Lemire D, Kaser O. Reordering columns for smaller indexes. Information Sciences June 2011; 181(12):2550–2570, doi:10.1016/j.ins.2011.02.002.
 [20] Bjørklund TA, Grimsmo N, Gehrke J, Torbjørnsen O. Inverted indexes vs. bitmap indexes in decision support systems. Proceedings of the 18th ACM conference on Information and Knowledge Management, CIKM ’09, ACM: New York, NY, USA, 2009; 1509–1512, doi:10.1145/1645953.1646158.
 [21] Holloway AL, Raman V, Swart G, DeWitt DJ. How to barter bits for chronons: Compression and bandwidth trade offs for database scans. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ACM: New York, NY, USA, 2007; 389–400.
 [22] Holloway AL, DeWitt DJ. Readoptimized databases, in depth. Proceedings of the VLDB Endowment 2008; 1(1):502–513.
 [23] Zukowski M, Heman S, Nes N, Boncz P. Superscalar RAMCPU cache compression. Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, IEEE Computer Society: Washington, DC, USA, 2006; 59–71, doi:10.1109/ICDE.2006.150.
 [24] Silvestri F, Venturini R. VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming. Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, ACM: New York, NY, USA, 2010; 1219–1228, doi:10.1145/1871437.1871592.
 [25] Anh VN, Moffat A. Index compression using 64bit words. Software: Practice and Experience 2010; 40(2):131–147, doi:10.1002/spe.v40:2.
 [26] Zhang J, Long X, Suel T. Performance of compressed inverted list caching in search engines. Proceedings of the 17th International Conference on World Wide Web, WWW ’08, ACM: New York, NY, USA, 2008; 387–396, doi:10.1145/1367497.1367550.
 [27] Witten IH, Moffat A, Bell TC. Managing gigabytes (2nd ed.): compressing and indexing documents and images. Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999.
 [28] Rice R, Plaunt J. Adaptive variablelength coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology 1971; 19(6):889 –897.
 [29] Elias P. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 1975; 21(2):194–203.
 [30] Büttcher S, Clarke C, Cormack G. Information retrieval: Implementing and evaluating search engines. The MIT Press, 2010.
 [31] Transier F, Sanders P. Engineering basic algorithms of an inmemory text search engine. ACM Transactions on Information Systems Dec 2010; 29(1):2:1–2:37, doi:10.1145/1877766.1877768.
 [32] Moffat A, Stuiver L. Binary interpolative coding for effective index compression. Information Retrieval 2000; 3(1):25–47, doi:10.1023/A:1013002601898.
 [33] Walder J, Krátký M, Bača R, Platoš J, Snášel V. Fast decoding algorithms for variablelengths codes. Information Sciences Jan 2012; 183(1):66–91, doi:10.1016/j.ins.2011.06.019.
 [34] Schlegel B, Gemulla R, Lehner W. Fast integer compression using SIMD instructions. Proceedings of the Sixth International Workshop on Data Management on New Hardware, DaMoN ’10, ACM: New York, NY, USA, 2010; 34–40, doi:10.1145/1869389.1869394.
 [35] Cutting D, Pedersen J. Optimization for dynamic inverted index maintenance. Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’90, ACM: New York, NY, USA, 1990; 405–411, doi:10.1145/96749.98245.
 [36] Williams H, Zobel J. Compressing integers for fast file access. The Computer Journal 1999; 42(3):193–201, doi:10.1093/comjnl/42.3.193.
 [37] Thiel L, Heaps H. Program design for retrospective searches on large data bases. Information Storage and Retrieval 1972; 8(1):1–20.
 [38] Williams R. An extremely fast zivlempel data compression algorithm. Proceedings of the 1st Data Compression Conference, DCC ’91, 1991; 362–371.
 [39] Goldstein J, Ramakrishnan R, Shaft U. Compressing relations and indexes. Proceedings of the Fourteenth International Conference on Data Engineering, ICDE ’98, IEEE Computer Society: Washington, DC, USA, 1998; 370–379.
 [40] Ng WK, Ravishankar CV. Blockoriented compression techniques for large statistical databases. IEEE Transactions on Knowledge and Data Engineering Mar 1997; 9(2):314–328, doi:10.1109/69.591455.
 [41] Delbru R, Campinas S, Tummarello G. Searching web data: An entity retrieval and highperformance indexing model. Web Semantics Jan 2012; 10:33–58, doi:10.1016/j.websem.2011.04.004.
 [42] Deveaux JP, RauChaplin A, Zeh N. Adaptive Tuple Differential Coding. Database and Expert Systems Applications, Lecture Notes in Computer Science, vol. 4653. Springer Berlin / Heidelberg, 2007; 109–119, doi:10.1007/9783540744696˙12.
 [43] Ao N, Zhang F, Wu D, Stones DS, Wang G, Liu X, Liu J, Lin S. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment May 2011; 4(8):470–481.
 [44] BaezaYates R, Jonassen S. Modeling static caching in web search engines. Advances in Information Retrieval, Lecture Notes in Computer Science, vol. 7224. Springer Berlin / Heidelberg, 2012; 436–446, doi:10.1007/9783642289972˙37.
 [45] Jonassen S, Bratsberg S. Intraquery concurrent pipelined processing for distributed fulltext retrieval. Advances in Information Retrieval, Lecture Notes in Computer Science, vol. 7224. Springer Berlin / Heidelberg, 2012; 413–425, doi:10.1007/9783642289972˙35.
 [46] Jones DM. The New C Standard: A Cultural and Economic Commentary. Addison Wesley Longman Publishing Co., Inc.: Redwood City, CA, USA, 2003.
 [47] Willhalm T, Popovici N, Boshmaf Y, Plattner H, Zeier A, Schaffner J. SIMDscan: ultra fast inmemory table scan using onchip vector processing units. Proceedings of the VLDB Endowment Aug 2009; 2(1):385–394.
 [48] Zhou J, Ross KA. Implementing database operations using SIMD instructions. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD ’02, ACM: New York, NY, USA, 2002; 145–156, doi:10.1145/564691.564709.
 [49] Inoue H, Moriyama T, Komatsu H, Nakatani T. A highperformance sorting algorithm for multicore singleinstruction multipledata processors. Software: Practice and Experience Jun 2012; 42(6):753–777, doi:10.1002/spe.1102.
 [50] Wassenberg J. Lossless asymmetric single instruction multiple data codec. Software: Practice and Experience 2012; 42(9):1095â–1106, doi:10.1002/spe.1109.
 [51] Boystov L. Clueweb09 posting list data set. http://boytsov.info/datasets/clueweb09gap/ [Last checked August 2012.] 2012.
 [52] Brenes DJ, GayoAvello D. Stratified analysis of AOL query log. Information Sciences May 2009; 179(12):1844–1858, doi:10.1016/j.ins.2009.01.027.
 [53] Pass G, Chowdhury A, Torgeson C. A picture of search. Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale ’06, ACM: New York, NY, USA, 2006, doi:10.1145/1146847.1146848.
 [54] Fusco F, Stoecklin MP, Vlachos M. NETFLi: Onthefly compression, archiving and indexing of streaming network traffic. Proceedings of the VLDB Endowment 2010; 3:1382–1393.
 [55] BaezaYates R, Salinger A. Experimental analysis of a fast intersection algorithm for sorted sequences. Proceedings of the 12th international conference on String Processing and Information Retrieval, SPIRE’05, SpringerVerlag: Berlin, Heidelberg, 2005; 13–24.
 [56] Ding B, König AC. Fast set intersection in memory. Proceedings of the VLDB Endowment 2011; 4(4):255–266.
 [57] Vigna S. Quasisuccinct indices. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, ACM: New York, NY, USA, 2013.
 [58] Schlegel B, Willhalm T, Lehner W. Fast sortedset intersection using SIMD instructions. ADMS Workshop, 2011.
 [59] Aksyonoff A. Introduction to Search with Sphinx: From installation to relevance tuning. O’Reilly Media, 2011.
Appendix A Informationtheoretic bound on binary packing
Consider arrays of distinct sorted 32bit integers. We can compress the deltas computed from such arrays using binary packing as described in § 2.6 (see Fig. 3). We want to prove that such an approach is reasonably efficient.
There are such arrays. Thus, by an informationtheoretic argument, we need at least bits to represent them. By a well known inequality, we have that . In effect, this means that we need at least bits/int.
Consider binary packing over blocks of integers: e.g., for BP32 we have and for SIMDBP128 we have . For simplicity, assume that the array length is divisible by and that is divisible by 32. Though our result also holds for vectorized differential coding (§ 3), assume that we use the common version of differential coding before applying binary packing. That is, if the original array is ( for all ), we compress the integers using binary packing.
For every block of integers, we have an overhead of bits to store the bit width . This contributes bits to the total storage cost. The storage of any given block depends also on the bit width for this block. In turn, the bit width is bounded by the logarithm of the difference between the largest and the smallest element in the block. If we write this difference for block as , the total storage cost in bits is
Because , we can show that the cost is maximized when . Thus, we have that the total cost in bits is smaller than
which is equivalent to bits/int. Hence, in the worst case, binary packing is suboptimal by bits/int. Therefore, we can show that BP32 is 2optimal for arrays of length less than integers: its storage cost is no more than twice the informationtheoretic limit. We also have that SIMDBP128 is 2optimal for arrays of length or less.