ExternalMemory Multimaps
Abstract
Many data structures support dictionaries, also known as maps or associative arrays, which store and manage a set of keyvalue pairs. A multimap is generalization that allows multiple values to be associated with the same key. For example, the inverted file data structure that is used prevalently in the infrastructure supporting search engines is a type of multimap, where words are used as keys and document pointers are used as values. We study the multimap abstract data type and how it can be implemented efficiently online in external memory frameworks, with constant expected I/O performance. The key technique used to achieve our results is a combination of cuckoo hashing using buckets that hold multiple items with a multiqueue implementation to cope with varying numbers of values per key. Our externalmemory results are for the standard twolevel memory model.
1 Introduction
A multimap is a simple abstract data type (ADT) that generalizes the map ADT to support keyvalue associations in a way that allows multiple values to be associated with the same key. Specifically, it is a dynamic container, , of keyvalue pairs, which we call items, supporting (at least) the following operations:

insert: insert the keyvalue pair, . This operation allows for there to be existing keyvalue pairs having the same key , but we assume w.l.o.g. that the particular keyvalue pair is itself not already present in .

isMember: return true if the keyvalue pair, , is present in .

remove: remove the keyvalue pair, , from . This operation returns an error condition if is not currently in .

findAll: return the set of all keyvalue pairs in having key equal to .

removeAll: remove from all keyvalue pairs having key equal to .

count: Return the number of values associated with key .
Surprisingly, we are not familiar with any previous discussion of this abstract data type in the theoretical algorithms and data structures literature. Nevertheless, abstract data types equivalent to the above ADT, as well as multimap implementations, are included in the C++ Standard Template Library (STL) [17], Guava–the Google Java Collections Library^{1}^{1}1http://code.google.com/p/googlecollections/, and the Apache Commons Collection 3.2.1 API^{2}^{2}2http://commons.apache.org/collections/apidocs/index.html. Clearly, the existence of these implementations provides empirical evidence for the usefulness of this abstract data type.
1.1 Motivation
One of the primary motivations for studying the multimap ADT is that associative data in the real world can exhibit significant nonuniformities with respect to the relationships between keys and values. For example, many realworld data sets follow a power law with respect to data frequencies indexed by rank. The classic description of this law is that in a corpus of natural language documents, defined with respect to words, the frequency, , of the word of rank is predicted to be
where is a parameter characterizing the distribution and is the th generalized harmonic number. Thus, if we wished to construct a data structure that can be used to retrieve all instances of any query word, , in such a corpus, subject to insertions and deletions of documents, then we could use a multimap, but would require one that could handle large skews in the number of values per key. In this case, the multimap could be viewed as providing a dynamic functionality for a classic static data structure, known as an inverted file or inverted index (e.g., see Knuth [11]). Given a collection, , of documents, an inverted file is an indexing strategy that allows one to list, for any word , all the places in where appears.
Another powerful motivation for studying multimaps is graphical data [3]. A multimap can represent a graph: keys correspond to nodes, values correspond to neighbors, findAll operations list all neighbors of a node, and removeAll operations delete a node from the graph. The degree distribution of many reallife graphs follow a power law, motivating efficient handling of nonuniformity.
As a more recent example, static multimaps were used for a geometric hashing implementation on graphical processing units in [2]. In this setting, signatures are computed from an image, and a signature can appear multiple times in an image. The signature is a key, and the values correspond to locations where the signature can be found. Geometric hashing allows one to find query images within reference images. Dynamic multimaps could allow for changes in reference images to be handled dynamically without recalculating the entire structure.
There are countless other possible scenarios where we expect multimaps can prove useful. In many settings, one can indicate the intensity of an event or object by a score. Examples include the apparent brightness of stars (measured by stellar magnitudes), the intensity of earthquakes (measured on the Richter scale), and the loudness of sounds (measured on the decibel scale). Necessarily, when data from such scoring frameworks is labelled as keyvalue pairs where the numeric score is the key, some scores will have disproportionally many associated values than others. In fact, in assigning numeric scores to observed phenomena, there is a natural tendency for human observers to assign scores that depend logarithmically on the stimuli. This perceptual pattern is so common it is known as the Weber–Fechner Law [7, 10]. Multimaps may prove particularly effective for such data sets.
1.2 Previous Related Work
Inverted files have standard applications in text indexing (e.g., see Knuth [11]), and are important data structures for modern search engines (e.g., see Zobel and Moffat [23]). Typically, this is a static structure and the collection is usually thought of as all the documents on the Internet. Thus, an inverted file is a static multimap that supports the findAll() operation (typically with a cutoff for the most relevant documents containing ).
Cutting and Pedersen [6] describe an inverted file implementation that uses Btrees for the indexing structure and supports incremental and batched insertions, but it doesn’t support deletions efficiently. More recently, Luk and Lam [14] describe an inmemory inverted file implementation based on hash tables with chaining, but their method also does not support fast deletions. Likewise, Lester et al. [12, 13] and Büttcher et al. [5] describe outofcore inverted file implementations that support insertions only. Büttcher and Clarke [4], on the other hand, consider the tradeoffs for allowing for both insertions and deletions in an inverted file, and Guo et al. [9] describe a solution for performing such operations by using a type of Btree.
Our work utilizes a variation on cuckoo hash tables. We assume the reader has some familiarity with such hash tables, as originally presented by Pagh and Rodler [18].^{3}^{3}3A general description can be found on Wikipedia at http://en.wikipedia.org/wiki/Cuckoo_hashing. We describe the relevant background in Section 2.
Finally, recent work by Verbin and Zhang [21] shows that in the external memory model, for any dynamic dictionary data structure with query cost , the expected amortized cost of updates must be at least 1. As explained below, this implies our data structure is optimal up to constant factors.
1.3 Our Results
In this paper we describe efficient externalmemory implementations of the multimap ADT. Our externalmemory algorithms are for the standard twolevel I/O model, which captures the memory hierarchy of modern computer architectures (e.g., see [1, 22]). In this model, there is a cache of size connected to a disk of unbounded size, and the cache and disk are divided into blocks, where each block can store up to items. Any algorithm can only operate on cached data, and algorithms must therefore make memory transfer operations, which read a block from disk into cache or vice versa. The cost of an algorithm is the number of I/Os required, with all other operations considered free. All of our time bounds hold even when , and we therefore omit reference to throughout.
We support an online implementation of the multimap abstract data type, where each operation must completely finish executing (either in terms of its data structure updates or query reporting) prior to our beginning execution of any subsequent operations. The bounds we achieve for the multimap ADT methods are shown in Table 1. All bounds are unamortized.
Method  I/O Performance 

insert  
isMember  
remove  
findAll  
removeAll  
count 
Our constructions are based on the combination of two externalmemory data structures—externalmemory cuckoo hash tables and multiqueues—which may be of independent interest. We show that externalmemory cuckoo hashing supports a cuckootype method for insertions that provably requires only an expected constant number of I/Os. We then show that this performance can be combined with expected constant I/O complexity for multiqueues to design a multimap implementation that has constant (unamortized) worstcase or expected I/O performance for most methods. Our methods imply that one can maintain an inverted file in external memory so as to support a constant expected number of I/Os for insertions and worstcase constant I/Os for look ups and item removal.
2 ExternalMemory Cuckoo Hashing
In this section, we describe externalmemory versions of cuckoo hash tables with multiple items per bucket. The implementation we describe in this section is for the map ADT, where all keyvalue pairs are distinct. We show later in this paper how this approach can be used in concert with multiqueues to support multiple keyvalue pairs with the same key for the multimap ADT.
Cuckoo hash tables that can store multiple items per bucket have been studied previously, having been introduced in [8]. Generally the analysis has been limited to buckets of a constant size, , where here size is measured in terms of the number of items, which in this context is a keyvalue pair in our collection, . For our externalmemory cuckoo hash table, each bucket can store items, where is a parameter defining our block size and is not necessarily a constant.
Formally, let be a cuckoo hash table such that each consists of buckets, where each bucket stores a block of size , with . (In the original cuckoo hash table setting, = 1.) One setting of particular interest is when for some (small) , so that space overhead of the hash table is only an factor over the minimum possible. The items in are indexed by keys and stored in one of two locations, or , where and are random hash functions. (The assumption that the hash functions are random can be done away with using suitable realistic hash functions; see for example [8] for a discussion, or [16] for an alternative model.)
It should be clarified that, in some settings, the use of a cuckoo hash function may be unnecessary or even unwarranted. Indeed, if for a suitable constant and , we can use simple hash tables, with just one choice for each item, instead. In this case, with Chernoff and union bounds one can show that with high probability all buckets will fit all the items that hash to it, since the expected number of items per bucket will then be , and is large enough for strong tail bounds to hold. Cuckoo hashing here allows us to avoid such “wide block assumptions”, giving a more general approach. In practice, also, across the full range of possible values for we expect cuckoo hashing to be much more space efficient. Whether this space savings is important may depend on the setting.
The important feature of the cuckoo hashing implementation is the way it may reallocate items in during an insertion. Standard cuckoo hashing, with one item per bucket, immediately evicts the previous (and only) item in a bucket when a new item is to be inserted in an occupied bucket. With multiple items per bucket, there is a choice available. We describe what is known in this setting, and how we modify it for our use here.
Let be the cuckoo graph, where each bucket in is a vertex and, for each item currently in , we connect and as a directed edge, with the edge pointing toward the bucket it is not currently stored in. Suppose we wish to insert an item into bucket in . If contains fewer than items, then we simply add to . Otherwise, we need to make room for the new item.
One approach for doing an insertion is to use a breadth first search on the cuckoo graph. The results of Dietzfelbinger and Weidling show that for sufficiently large constant , the expected insertion time is constant. Specifically, when and , the expected time to insert a new key is , which is a constant. (This may require rehashing all items in very rare cases when an item cannot be placed; the expected time remains constant.) Notice that if grows in a fashion that is , then a breadth first search approach does not naturally take constant expected time, as even the time to look if items currently in the bucket can be moved will take time. (It might still take constant expected time – it may be that only a constant number of buckets need to be inspected on average – but it does not appear to follow from [8].)
For nonconstant , we can apply the following mechanism: we can use our buckets to mimic having distinct subtables for some large constant , where the th subtable uses the th fraction of the bucket space, and each item is hashed into a specific subtable. For for , each subtable will contain close to its expected number of items with high probability. Further, by choosing suitably large one can ensure that each subtable is within a factor of its total space while maintaining an expected insertion time. Specifically, we have the following theorem:
Theorem 1.
Suppose for a cuckoo hash table the block size satisfies and for . Let be arbitrary, let be a collection of items, and let be a table with at least blocks. Suppose further we have subtables, with with each item hashed to a subtable by a fully random hash function, and the hash functions for each subtable are fully random hash functions. Finally, suppose the items of have been stored in by an algorithm using the partitioning process described above and the cuckoo hashing process. Then the expected time for the insertion of a new item using a BFS is .
Proof.
Each subtable has the capacity to hold items, and will receive an expected items to store. Let be the number of items in the first subtable. A standard Chernoff bound (e.g., [15][Theorem 4.4]) gives that is at most with probability bounded by
With for , we see that all subtables have at most with probability subexponential in . By keeping counters for each subtable, we can rehash the items of all subtables in the rare case where a subtable exceeds this number of items without affecting the expected insertion time by more than an term.
The proof follows from Theorem 2 of [8], by noting that each subtable has space for at least items. (In rare cases where an insertion fails, we can reinsert all items in a subtable without affecting the expected insertion time by more than an term.) ∎
It is likely this result could be improved (see the remarks in [8]), but it is sufficient for our purposes of showing that there is an insertion method for externalmemory cuckoo tables that uses a constant expected number of I/Os.
As noted in [8], a more practical approach is to use random walk cuckoo hashing in place of breadth first search cuckoo hashing. (For example, random walk cuckoo hashing is used in all experiments in [8].) With random walk cuckoo hashing, when an item cannot be placed, it kicks out a single item in the bucket chosen uniformly at random. Random walk cuckoo hashing avoids the potentially large rare memory overhead required of breadth first search, allowing instead a nearly stateless solution.
More specifically, suppose a bucket is full when placing an item . To reallocate items, we perform a random walk on the buckets, starting from , to find an augmenting path that has the net effect of freeing up a location in (for ) while maintaining the twochoice allocation rule for all the existing items in . Let denote the current node we are visiting in our random walk (which is associated with a full bucket in the externalmemory cuckoo table—initially, the bucket ). To identify the next node to visit, we choose one of the items, , in , uniformly at random. We then remove from and insert the item waiting to be inserted in . We then let take over the role of , and attempt to place in the other bucket that is a possible location for this item. We repeat this process until we find a nonfull bucket or reach a predefined stopping condition.
For loads arbitrarily close to one, it is not known if there is a random walk cuckoo hashing scheme using two bucket choices and multiple items per bucket that similarly achieves expected constant insertion time and logarithmic insertion time with high probability. (This is given as an open question in [8].) Sadly, we do not resolve this question here.
However, for loads up to about we can utilize results by Panigrahy [19, 20] to obtain such a random walk cuckoo hashing scheme. In Theorem 2.3.2 of [20], he shows that for hash tables for items and load factors of satisfying , when the bucket size is 2, random walk cuckoo hashing will succeed in inserting an item with a path of length with probability ; his argument also shows that this process has expected constant insertion time. This allows loads up to (approximately) using our partitioning technique above. In practice, we might expect this load to be improved significantly in various ways. First, we might ignore the partitioning, and instead perform the random walk directly on the buckets with load . Analyzing this process is difficult, in part because of the greatly increased possibility of cycles in the cuckoo graph. Alternatively, we could perform the partitioning but allow the random walk to stop early if there is room in the block , rather than the bucket for the corresponding subtable, effectively multiplexing the bucket over subtable instantiations. We consider these multiple variations in the simulations of Section 5.
Finally, we point out that, as in a standard cuckoo hash table, item look ups and removals use a worstcase constant number of I/Os.
3 ExternalMemory Multimaps
In this section, we describe an extension of the externalmemory cuckoo hash table (as described in Section 2) that can be used to maintain a multimap in external memory, so as to support fast dynamic access of a massive data set of keyvalue pairs where some keys may have many associated values.
3.1 The Primary Structure
To implement the multimap ADT, we begin with a primary structure that is an externalmemory cuckoo hash table storing just the set of keys. In particular, each record, , in , is associated with a specific key, , and holds the following fields:

the key, , itself

the number, , of keyvalue pairs in with key equal to

a pointer, , to a block in a secondary table, , that stores items in with key equal to . Let denote the number of keyvalue pairs in with key equal to . If , then stores all the items with key equal to (plus possibly some items with keys not equal to ). Otherwise, if , then points to a first block of items with key equal to , with the other blocks of such items being stored elsewhere in .
This secondary storage is an externalmemory data structure we are calling a multiqueue.
3.2 An ExternalMemory LocationAware Multiqueue
3.2.1 Overview
The secondary storage that we need in our construction is a way to maintain a set of queues in external memory. We assume the header pointers for these queues are stored in an array, , which in our externalmemory multimap construction is the externalmemory cuckoo hash table described above.
For any queue, , we wish to support the following operations:

enqueue(): add the element to , given a pointer to its header, .

remove(): remove from . We assume in this case that each is unique.

isMember(): determine whether is in some queue, .
In addition, we wish to maintain all these queues in a spaceefficient manner, so that the total storage is proportional to their total size. To enable this, we store all the blocks used for queue elements in a secondary table, , of blocks of size each. Thus, each header record, in , points to a block in .
Our intent is to store each queue as a doublylinked list of blocks from . Unfortunately, some queues in are too small to deserve an entire block in dedicated to storing their elements. So small queues must share their first block of storage with other small queues until they are large enough to deserve an entire block of storage dedicated to their elements. Initially, all queues are assumed to be empty; hence, we initially mark each queue as being light. In addition, the blocks in are initially empty; hence, we link the blocks of in a consecutive fashion as a doublylinked list and identify this list as being the free list, , for .
We set a heavysize threshold at elements. When a queue stored in a block reaches this size, we allocate a block from (taking a block off the free list ) exclusively to store elements of and we mark as heavy. Likewise, to avoid wasting space as elements are removed from a queue, we require any heavy queue to have at least elements. If a heavy queue’s size falls below this lower threshold, then we mark as being light again and we force to go back to sharing it space with other small queues. This may in turn involve returning a block to the free list . In this way, each block in will either be empty or will have all its elements belonging to a single heavy queue or as many as light queues. In addition, these rules also imply that element insertions are required to take a queue from the light state to the heavy state and element removals are required to take a queue from the heavy state to the light state.
If a block in is being used for light queues, then we order the elements in according to their respective queues. Each block for a heavy queue stores previous and next pointers to the neighboring blocks in the linked list of blocks for , with the first such block pointing back to the header record for . As we show, this organization allows us to maintain our size and label invariants during execution of enqueue and remove operations.
One additional challenge is that we want to support the remove() operation to have a constant I/O complexity. Thus, we cannot afford to search through a list of blocks of a queue looking for an element we wish to remove. So, in addition to the table and its free list, , and the headers for each queue in , we also maintain an externalmemory cuckoo hash table, , to be a dictionary that maps each queue element to the block in that stores . This allows our multiqueue to be locationaware, that is, to support fast searches to locate the block in that is holding any element that belongs to some queue, .
We will call any block in containing fewer than items deficient. In order to ensure that our multiqueue uses total storage proportional to its total size,we will enforce the following two rules. Together, these rules guarantee that there are deficient blocks in , and hence our multiqueue uses blocks of memory.

Each block in stores a pointer , called the deficient pointer, to a block ; the identity of this block is allowed to vary over time. We ensure that at all times, is the only (possibly) deficient block associated with that stores light queues.

Each heavy queue also stores in its header block a deficient pointer to a block . At all times, is the only (possibly) deficient block devoted to storing values for .
3.2.2 Full Description
For the remainder of this subsection, we describe how to implement all multiqueue operations to obtain constant amortized expected or worstcase runtime. We show how to deamortize these operations in Section 3.3.
The Split Action.
As we perform enqueue operations, a block may overflow its size bound, . In this case, we need to split in two, which we do by allocating a new block from (using its free list). We call the source of the split, and the sink of the split. We then proceed depending on whether contains elements from light queues or a single heavy queue.

contains elements from light queues. We greedily copy elements from into until has size has size at least , keeping the elements from the same light queue together. Note that each light queue has less than elements, so this split will result in at least a – balance.
Of course, to maintain our invariants, we must change the header records from to for any queues that we just moved to . We can achieve this by performing a lookup in for each key corresponding to a queue that was moved from to , and modifying its header record, which requires I/Os. Similarly, in order to support location awareness, we must also update the dictionary . So, for each element that is moved to , we look up in and update its pointer to now point to . In total this costs I/Os.

contains elements from a single heavy queue . In this case, we move no elements, and simply take a block from the free list and insert it as the head of the list of blocks devoted to , changing the header record in to point to . We also change the deficient pointer for to point to , and insert into the element that caused the split. This takes I/O operations in total.
So, to sum up, when a block holding light queues results from a split (source or sink), it has size at least and at most . When a block holding elements from a heavy queue is split, no items are moved and a block is taken from the free list and inserted as the new header block of the heavy queue; the new header then contains only one item, and is identified by the deficient pointer of .
The Enqueue Operation.
Given the above components, let us describe how we perform the enqueue and remove operations. We begin with the enqueue() operation. We consider how this operation acts, depending on a few cases.

The queue for is empty (hence, is a null pointer and its queue is light). In this case, we examine the block from to which belongs. If is null, we first take a block of the free list and set to before continuing. We follow the deficient pointer for to a block , and add to . If this causes the size of to reach , then we split as described above.

The queue for is not empty. We proceed according to two cases.

If is a light queue, we follow to its block in and add to . If this brings the size of above , we perform a lighttoheavy transition, taking a block off the free list, moving all elements in to , and marking as heavy. If this brings the size of below , we process as in the remove operation below.

If is a heavy queue, we add to , the (possibly) deficient block for . If this brings the size of to , then we split , as described above.

Once the element is added to a block in , we then add to the dictionary , and have its record point to .
The Remove and isMember Operations.
In both of these operations, we look up in to find the block in that contains . In the isMember() case, we complete the operation by simply looking for in . In the remove() operation, we do this look up and then remove from if we find . If this causes to become empty, then we update its header, , to be null. In addition, if this operation causes the size of to go below , then we need to do some additional work, based on the following cases:

is a heavy queue.

If is the only block for , then should now be considered a light queue; hence, we continue processing it according to the case listed below where contains only light queues. We refer to the entirety of this action as a heavytolight queue transition.

Otherwise, if , then we are done because is allowed to be deficient. If , we proceed based on the following two cases:

alteration action: If the size of is at least , we simply update ’s deficient pointer, , to point to instead of .

Merge action: If the size of is less than , then we move all of the elements of into and we update the pointer in for each moved element. is returned to the free list. We call the source of the merge, and the sink. (Note that in this case, the size of becomes at most .)



contains light queues (hence, no heavy queue elements). In this case, we visit the header for . Let denote the block containing .

If we are done, since is allowed to be deficient.

If , let be the size of .

alteration action: If then we simply update to point to instead of .

Merge action: If , then we merge the elements in into , which now has size at most , and update pointers in and for the elements that are moved. We return to the free list. We call the source of the merge and the sink.


If a block is pointed to by any deficient pointer , it is helpful to think of this as “protection” for from being the source of a merge. Once is afforded this protection, it will not lose it until its size is at least (see the alteration action). At a high level, this will allow us to argue that if and are respectively the source and sink of a merge action, neither nor will be the source of a subsequent merge or split operation until it is the target of enqueue or remove operations, even though may have size very close to the deficiency threshold .
3.2.3 Amortized I/O complexity
We now argue formally that enqueue and remove() take amortized time. Notice that the only actions that result in the movement of items between blocks are lighttoheavy and heavytolight queue transitions, merge actions, and split actions for blocks containing light queues. Notice for splits involving heavy queues, we perform I/O operations in the worst case, and do not need to perform an amortized analysis.
We first argue that lighttoheavy queue transitions as well as heavytolight transitions contribute amortized I/Os to enqueue operations. Indeed, a lighttoheavy queue transition requires I/Os in total: we require I/Os to move items from to and update pointers in and , and additional I/Os to process as in a remove operation if this causes the size of to fall below . Each such heavytolight transition must be preceded by at least enqueue operations to bring the queue from size at most to size at least , so we can charge these I/Os to these enqueue operations. These enqueue operations will never be charged again. Similarly, a heavytolight queue transition requires I/Os, which we can charge to the (at least) removals that caused ’s size to fall from to ; these removals will never be charged again.
Since we have accounted for the I/Os caused by lighttoheavy and heavytolight queue transitions, we may ignore all I/Os caused by these transitions through the remainder of the argument. We now argue that merge and split actions contribute amortized I/Os as well, beginning with merge actions.
Suppose and are respectively the source and sink of a merge action. We claim that neither nor will be the source of a subsequent merge or split operation until it is the target of enqueue or remove operations. Indeed, notice that after doing a merge action as a part of our processing of a remove operation, the sink will contain at most elements and will be equal to or , and the source is on the free list. As and are protected from merges, it would take at least enqueues or removals in these blocks before they would be sources of another split or merge operation.
Likewise, after performing a split of a block containing light queues as a part of an enqueue operation, both source and sink will be of size at least and at most . Thus, it would take at least enqueues or removals in these blocks before they would be sources of another split or merge operation.
Therefore, in an amortized analysis, we can charge the I/Os performed in a split or merge action to the previous operations that caused one of these blocks to shrink to size or grow to size . These enqueues and removals will never be charged again.
The arguments of the last two paragraphs are depicted graphically in Figure 1. Assuming no lighttoheavy or heavytolight transitions take place (we may assume this because we have separately accounted for the I/O cost of these transitions), we depict a subgraph of the state diagram for any block . Specifically, we depict all state transitions caused by any action that results in the movement of items from one block to another; for brevity, we omit the effects of any actions that do not result in the movement of items. We refer to any state corresponding to a source of a merge or split action as a “source state.” It is clear that in the subgraph depicted in Figure 1, there is no directed path from any nonsource state to any source state. Given this fact, it is a straightforward exercise to confirm that the only paths from nonsource states to source states in the full state diagram (assuming no lighttoheavy or heavytolight transitions) include at least enqueue or remove operations to .
3.3 Deamortizing Multiqueue Operations
We now explain how to deamortize the multiqueue operations of the previous section. First, notice that the only actions that result in the movement of items between blocks are merge actions, split actions for blocks devoted to heavy queues, lighttoheavy queue transitions, and heavytolight queue transitions. We will require the follow property: for any action resulting in the movement of items from source block to sink block , neither nor will be the source of any subsequent action requiring the movement of items until it is the target of at least enqueue or remove operations.
First, we describe some modifications to the lighttoheavy and heavytolight queue transitions that are necessary to ensure this property is satisfied. We begin with lighttoheavy transitions. Previously, as soon as a light queue grew to size , it was moved from its block to a block devoted exclusively to ; this could cause the size of to fall close to or below , and could therefore be the source of a merge shortly after (or immediately upon) the lighttoheavy transition. Because this clearly does not satisfy the required property, we will do away with an explicit lighttoheavy transition action, and instead fold this functionality into the split action as follows.
We leave unmodified the split action for blocks devoted to heavy queues, as well as for blocks containing only light queues in which none of the queues have size greater than . It is easy to see in both of these cases that the required property is satisfied, as in the first case (split for blocks devoted to heavy queues) no items are moved, and in the second case both the source and sink of the split have size between and .
However, if the source of the split move contains a queue of size at least , we proceed according to the following cases.

contains a queue of size between and . We take a new block off the free list and move all items in to , marking as heavy and updating the affected pointers in and . After this split action, both and have size between and , and hence neither will be the source of a split action or merge action until it is the target of at least enqueue or remove operations.

contains a queue of size greater than . Let denote the items in that are not in . We proceed according to the following cases.

If has size less than , we leave in and mark it as heavy. In addition, we transfer all items in to , and update all affected pointers in and . After the split, is devoted to and has size at least . now has size at most , and moreover and thus is protected from being the source of a merge. It therefore requires at least inserts or removals to or before either can be the source of any action requiring the movement of items between blocks.

If has size greater than , we leave in and mark it as heavy. We take a new block off the free list and transfer all items in to . We update all affected pointers in and , and modify the deficient pointer of to point to . The source block is devoted to and has size at least . has size , and moreover and thus is protected from being the source of a merge. It therefore requires at least inserts or removals to or before either can be the source of any subsequent action requiring the movement of items between blocks.

Let us now explain a small modification we must make to the heavytolight transitions in order to satisfy the required property. Observe that it is possible for a queue to undergo a heavytolight transition shortly after the final two blocks and devoted to are merged into one. For example, it is possible that contains one item before the merge and items after the merge; if one item is subsequently removed from , will undergo a heavytolight transition, and our required property will not be satisfied. This is the only setting in which a deficient pointer fails to “protect” a block from being merged. To circumvent this difficulty, we modify the heavytolight queue transition to only occur when the size of the heavy queue falls below rather than . With this in hand, the arguments of Section 3.2.3 suffice to show that any merge action or heavytolight transition satisfies our required property. This completes the description of all modifications necessary to ensure the required property is satisfied by all actions.
We now explain how to deamortize the operations of Section 3.2, which all required amortized time. The only actions requiring I/O operations in Section 3.2 were split actions, merge actions, heavytolight transitions, and lighttoheavy transitions that caused elements from a source block to be moved to a sink block (the latter have now been replaced with a modified split operation). These actions required I/O operations to immediately update all affected pointers in and . To deamortize these operations, we immediately move the elements from to , but do not immediately update any pointers in and . Instead, we create a pointer from to , allowing us to spread out the updates to and over many operations as follows.
We will ensure that any block need point to at most one block at any time; specifically, any time a split action or merge action causes items to move from block to block , we will overwrite the old value of with the new value. To clarify, when a block is sent to the free list as a result of a merge operation, it must maintain its pointer throughout its time on the free list; it is only safe to overwrite when items are once again moved from to another block .
We will also ensure that no queue is ever moved more than once before its header in and the records for all of its keyvalue pairs in are brought uptodate. Given this fact, if we ever follow a pointer from or to a block , and the corresponding item is not in , we need only look in for the item as well.
To this end, we associate with each block in a bitarray of length indicating which items in have uptodate pointers in and . Any time items are moved into as a result of a split or merge action, we set the corresponding bits in the bitarray of to 0, indicating these items are not uptodate. Further, we modify the enqueue( and remove operations such that if is stored in block , then we update the pointers in and of up to 12 items in and 12 items from from that are not uptodate. We then mark these items as uptodate. This requires only I/O operations for each enqueue or remove( function call.
We finally argue that each time items from a block to be moved to a block , it is safe to overwrite with a pointer to . Indeed, we carefully argued above that all actions resulting in a movement of items from source block to sink block satisfy our required property. It is easy to see that this implies neither nor will be the source of another sink or merge until it is the target of at least enqueue or remove operations. By that point, all items in (or ) and (or ) will be uptodate, so it safe to overwrite (or ).
We obtain the following theorem.
Theorem 2.
We can implement a locationaware multiqueue so that the remove() and isMember() operations each use I/Os, and the enqueue() operation uses expected I/Os, where is the expected number of I/Os needed to perform an insertion in an externalmemory cuckoo table of size .
It should be clear from our description that, except for trivial cases (such as having only a constant number of elements), the space requirements of our multiqueue implementation is within a constant factor of the optimal. We have not attempted to optimize this factor, though there is some flexibility in the multiqueue operations (such as when to do a split) that would allow some optimization. We study these tradeoffs in Section 5.
4 Combining Cuckoo Hashing and LocationAware Multiqueues
In this section, we describe how to construct an efficient externalmemory multimap implementation by combining the data structures described above. The result is a cuckoo hash table in external memory so as to support constant expectedI/O insertions and optimal findAll and removeAll operations.
We store an externalmemory cuckoo hash table, as described above, as our primary structure, , with each record pointing to a block in a multiqueue, , having an auxiliary dictionary, , implemented as yet another externalmemory cuckoo hash table. We then perform each of the operations of the multimap ADT as follows.

insert: To insert the keyvalue pair, , we first perform a look up for in . If there is already a record for in , we increment its count. We then follow its pointer to the appropriate block in , (in the deamortized implementation, the queue for may reside in rather than ), and add the pair to , as in the enqueue multiqueue method. Otherwise we insert into with a null header record and count 1 and then add the pair to as in the enqueue multiqueue method.

isMember: This is identical to the isMember multiqueue operation.

remove: To remove the keyvalue pair, , from , we perform a look up for in . If there is no record for in , we return an error condition. Otherwise, we follow this pointer to the appropriate block of holding the pair (in the deamortized implementation, if is not in , we may have to look in as well). We remove the pair from and as in the remove multiqueue method, and decrement its count.

findAll: To return the set of all keyvalue pairs in having key equal to , we perform a look up for in , and follow its pointer to the appropriate block of (in the deamortized implementation, the queue for may reside in rather than ). If this is a light queue, then we just return the items with key equal to . Otherwise, we return the entire block and all the other blocks of this queue as well.

removeAll: We give here a constant amortized time implementation, and explain in Section 4.1 how to deamortize this operation. To remove from all keyvalue pairs having key equal to , we perform a look up for in , and follow its pointer to the appropriate block of (in the deamortized implementation, the queue for may reside in rather than ). If this is a light queue, then we remove from all items with key equal to and remove all affected pointers from ; if this causes to become deficient, we perform a merge action or alteration action as in the remove multiqueue method. If this is a heavy queue, we walk through all blocks of this queue and remove all items from these blocks and return each block to the free list. We also remove all affected pointers from . Finally, we remove the header record for from , which implicitly sets the count of to zero as well. We charge, in an amortized sense, the work for all the I/Os to the insertions that added these keyvalue pairs to in the first place.

count(): Return , which we track explicitly for all keys in .
4.1 Deamortizing removeAll()
The removeAll() operation of Section 4 required amortized I/O operations in the worst case without altering the capacity of our structure. We now describe a deamortized implementation that also requires I/O operations and does not alter the capacity. We perform a look up for in . If no record is found, we are done. Otherwise we follow its pointer to the header of its queue . We remove all items in from , and set ’s pointer in to null. This completes the operation; notice we do not update any records in at this time. Instead, we explain the modifications necessary to handle the existence of “spurious” pointers in (i.e. pointers for pairs which were deleted in a removeAll operation) with an increase in the I/O cost of the insert, remove(), isMember(), and findAll( operations.
First, we describe a function isSpurious() that requires I/O operations and determines whether an entry in is spurious. isSpurious() first peforms a look up in for . If no record for exists, we return false. Otherwise, we follow the pointer for to a block in and search for . If a record is found we return false. Otherwise, we follow the pointer (described in Section 3.3) to a block and search for in . If it is found, we return false, otherwise we return true.
We now describe how to modify the insertion method of our externalmemory cuckoo hash table so that the presence of spurious pointers does not decrease the table’s capacity. First, when inserting a keyvalue pair into , we begin by doing a look up in for . If a record for exists, we call isSpurious. If this function returns false, we return an error condition. Otherwise, we remove the record for from before proceeding. This ensures that at all times there is only one entry for each pair in .
Second, we modify the BFSbased insertion procedure of Theorem 1 as follows. For each bucket visited by the BFS, we call isSpurious for all pairs residing in the bucket. If this function returns true for any pair , we delete from and insert the new pair in its place. This ensures that no spurious entry in ever prevents another entry from being inserted, i.e., the spurious entries will have no effect on the capacity of the table. Since the buckets in the cuckoo hashing algorithm of Theorem 1 have constant size, calling isSpurious on a bucket requires just I/O operations.
With this in hand, we finally describe how to modify the insert, remove(), isMember(), and findAll( operations to handle the presence of spurious entries in with only an increase in the I/O complexity of each operation.

insert: Works unmodified.

isMember(): We call the previous implementation of isMember( as well as the function checkSpurious(. We return true if and only if the former returns true and the latter returns false.

remove: We perform a look up for in . If none is found, we return an error condition. Otherwise, we call the function isSpurious(. If this returns true, we return an error condition. Otherwise, we call the old implementation of remove.

findAll: Works unmodified.
We finally obtain the following theorem.
Theorem 3.
One can implement the multimap ADT in external memory using blocks of memory with I/O performance as shown in Table 1.
5 Experimental Results
We performed extensive simulations of our algorithm in order to explore how various settings of the design parameters affect I/O complexity and space usage, for both our basic algorithm (Section 3.2) and our deamortized algorithm (Section 3.3).
We simulated a cache of size KB with blocks of size 4 KB. Our simulated cache used the leastrecently used page replacement rule. When reporting the number of I/Os, we count only transfers from disk to cache; each such transfer is preceded by a transfer from cache to disk of the least recently used cache page, and we do not count this transfer in our reported values. We drew keys from a universe of size million, using bytes to store each key, and 8 bytes to store each value. We did not explicitly store queues as doublylinked lists, but instead laid them out as arrays within their blocks, with a marker representing the end of one queue and the beginning of another; this allowed us to avoid storing expensive pointers for these lists. We used 4 bytes to represent all pointer values in and . We did not charge for storing the counts associated with each key because we do not need to store these counts explicitly except to achieve I/O operations for the count operation (and moreover we can achieve this by only storing explicit counts for heavy queues, as the count of a light queue can be obtained in O(1) I/Os by finding ’s unique block in via a lookup in and then counting how many items contains).
All results presented use randomwalk cuckoo hashing with two hash functions and buckets capable of storing 4 KBs of data; we found that using the partitioning technique of Theorem 1 to implement cuckoo hashing required slightly more space (and I/O complexity was comparable) because the hash tables had slightly smaller capacity. For our hash tables and , we allotted a space overhead of ; we found this was even more overhead than strictly necessary. We also ran a full set of experiments using three hash functions to implement cuckoo hashing, but found that two hash functions was sufficient due to the large bucket size; we found using two hash functions instead of three saved about 1 I/O per insert and remove operation. To capture realistic frequency distributions, which are often skewed, we generated all keys for insertions from a Zipfian distribution; in a Zipfian distribution with parameter , the frequency of the ’th most frequent item is proportional to . The larger , the more skewed the frequency distribution.
Our goal was to identify the steadystate behavior of our data structure. In all experiments, we performed a sequence of 1 million insertions, followed by a sequence of 8 million alternating insert( and remove operations. For each remove operation, the pair for removal was selected uniformly at random from the table.
5.1 Basic Implementation
Space usage results from the basic algorithm are shown in Figures 3(a) and 3(b). The vertical line represents the point at which we completed million insertions and began alternating insertions and deletions. denotes the Zipfian parameter, denotes the lighttoheavy queue transition threshold, and is the deficiency parameter (i.e. blocks of size less than are declared deficient). Notice we experiment with more aggressive settings of and for .
Across all parameter settings, we achieved steadystate loads of between and , where we defined the load to be , where is the number of bytes used by pages not on the free list in our algorithm, and is the minimum number of bytes required to explicitly store all keyvalue pairs in the structure. Notice with KB blocks, bytes corresponds to just over 3,000 blocks of memory.
I/O statistics from representative settings of parameters are in Table 2. A smaller results in improved space usage as a smaller implies that we perform merge actions more aggressively. Similarly, a smaller implies we are more reluctant to tie up entire blocks devoted to a single heavy queue, and thus yields improved space usage.
In the basic algorithm, the average cost over all insert and remove operations is extremely low: about I/Os per operation. However, as depicted in Table 2 the cost distribution is bimodal – the vast majority (over 99.9%, except for very close to 2) of operations require about 4 I/Os, but a small fraction of operations require several hundred. The maximum number of I/Os ranges between and .
These highcost operations are due to split and merge actions. The deamortized implementation displays substantially different behavior, with no operation requiring more than a few dozen I/Os (see Section 5.2). Notice we tested parameter values for which the theoretical bounds on I/O complexity do not hold; for example, with , a merge may immediately follow a split.



5.2 Deamortized Implementation
Figures 4(a) and 4(b) presents space usage results for the deamortized implementation, following the same protocol as the amortized experiments (Section 5.1). We achieved loads of about to for basic parameter values ( and ). We also experimented with very high settings of , where we tradeoff increased space usage for improved I/O complexity.
More interesting is the I/O complexity of the deamortized implementation, shown in Table 3. We see that in stark contrast to the bimodal cost distribution of the basic implementation, the deamortized implementation never requires more than a few dozen I/Os for any given operation. Moreover, even the average I/O complexity of the deamortized implementation is significantly better than that of the basic implementation, with an improvement of at least I/Os per operation, for parameters where we have a direct comparison. We attribute much of this improvement to the modified split rule, which makes lighttoheavy queue transitions significantly less expensive. Note that the maximum number of I/Os for any operation in our deamortized experiments across all parameters, is at most 43 – an order of magnitude below the maximum for our basic algorithm.
In Table 3, we also display the breakdown in I/O complexity between inserts and remove operations. We see that removes are about twice as expensive as inserts; this is not unexpected. An insert requires a look up in , followed by loading the header page for the queue , an insert into , and then possibly a split. Due to the skewness of our input data, these first two steps can be free, as these pages are often already in the cache. A remove requires a look up in , followed by loading the appropriate page in , and then possibly a merge. In contrast to inserts, the first two steps are rarely free.





We note that we have tested our performance against a working commerical database product, which places keyvalue pairs in a hash table, but moves keyvalue pairs associated with a key to a Btree when the number of values associated with a key becomes large. Preliminary tests suggest this approach yields a slightly smaller average number of memory accesses per operation, but in turn runs with significantly more memory. We expect that both implementations could be optimized significantly, and each approach might be preferable in different circumstances.
6 Conclusion
We have described an efficient externalmemory implementation of the multimap ADT, which generalizes the inverted file data structure that is useful for supporting search engines. Our methods are based on new expectedtime bounds for performing updates in blockbased cuckoo hash tables as well as an externalmemory multiqueue data structure. In addition to proving theoretical bounds on the I/O complexity of our implementation, we demonstrated experimentally that our data structure is able to trade off constant factors in space against the time to perform operations in wellunderstood ways.
One direction for future work is to consider efficient inmemory algorithms for multimaps, an area that seems to not have been given significant attention. Another natural direction would be to derive improved highprobability bounds for blockbased cuckoo hash tables. In particular, improved analysis of random walk cuckoo hashing in this setting is worthwhile. These are natural extensions of open problems in the theory of cuckoo hashing.
Acknowledgments
We thank Margo Seltzer for several helpful discussions.
References
 [1] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31:1116–1127, 1988.
 [2] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Realtime parallel hashing on the GPU. ACM Trans. Graph., 28:154:1–154:9, 2009.
 [3] D. K. Blandford and G. E. Blelloch. Compact dictionaries for variablelength keys and data with applications. ACM Transactions on Algorithms, 4(2), 2008.
 [4] S. Büttcher and C. L. A. Clarke. Indexing time vs. query time: tradeoffs in dynamic information retrieval systems. In Proc. of 14th ACM Conf. on Information and Knowledge Management (CIKM), pages 317–318. ACM, 2005.
 [5] S. Büttcher, C. L. A. Clarke, and B. Lushman. Hybrid index maintenance for growing text collections. In Proc. of 29th ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), pages 356–363. ACM, 2006.
 [6] D. Cutting and J. Pedersen. Optimization for dynamic inverted index maintenance. In 13th ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR ’90, pages 405–411. ACM, 1990.
 [7] S. Dehaene. The neural basis of the WeberFechner law: a logarithmic mental number line. Trends in Cognitive Sciences, 7(4):145–147, 2003.
 [8] M. Dietzfelbinger and C. Weidling. Balanced allocation and dictionaries with tightly packed constant size bins. Theoretical Computer Science, 380:47–68, 2007.
 [9] R. Guo, X. Cheng, H. Xu, and B. Wang. Efficient online index maintenance for dynamic text collections by using dynamic balancing tree. In Proc. of 16th ACM Conf. on Information and Knowledge Management (CIKM), CIKM ’07, pages 751–760. ACM, 2007.
 [10] S. Hecht. The visual discrimination of intensity and the WeberFechner law. Journal of General Physiology, 7(2):235–267, 1924.
 [11] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. AddisonWesley, Reading, MA, 1973.
 [12] N. Lester, A. Moffat, and J. Zobel. Efficient online index construction for text databases. ACM Trans. Database Syst., 33:19:1–19:33, September 2008.
 [13] N. Lester, J. Zobel, and H. Williams. Efficient online index maintenance for contiguous inverted lists. Inf. Processing & Management, 42(4):916–933, 2006.
 [14] R. W. Luk and W. Lam. Efficient inmemory extensible inverted file. Information Systems, 32(5):733–754, 2007.
 [15] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005.
 [16] M. Mitzenmacher and S. Vadhan. Why simple hash functions work: exploiting the entropy in a data stream. In Proc. of the 19th Annual ACMSIAM Symp. on Discrete Algorithms, pages 746–755, 2008.
 [17] D. R. Musser and A. Saini. The STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1995.
 [18] R. Pagh and F. Rodler. Cuckoo hashing. Journal of Algorithms, 52:122–144, 2004.
 [19] R. Panigrahy. Efficient hashing with lookups in two memory accesses. In Proc. of the 16th Annual ACMSIAM Symp. on Discrete Algorithms, pages 830–839, 2005.
 [20] R. Panigrahy. Hashing, Searching, Sketching. Ph.D. thesis, Dept. of Computer Science, Stanford University., 2006.
 [21] E. Verbin and Q. Zhang. The limits of buffering: a tight lower bound for dynamic membership in the external memory model. In STOC, pages 447–456, 2010.
 [22] J. S. Vitter. External sorting and permuting. In M.Y. Kao, editor, Encyclopedia of Algorithms. Springer, 2008.
 [23] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38, July 2006.