Hash in a Flash: Hash Tables for Solid State Devices
In recent years, information retrieval algorithms have taken center stage for extracting important data in ever larger datasets. During computation, these datasets are both stored in memory and on disk. As datasets become larger demand for superior storage solutions will rise. Advances in hardware technology have lead to the increasingly wide spread use of flash storage devices. Such devices have clear benefits over traditional hard drives in terms of latency of access, bandwidth and random access capabilities particularly when reading data. Thus traditional informational retrieval algorithms, such as TF-IDF, can greatly benefit. There are however some interesting trade-offs to consider when leveraging the advanced features of such devices. On a relative scale writing to such devices can be expensive. This is because typical flash devices (NAND technology) are updated in blocks. A minor update to a given block requires the entire block to be erased, followed by a re-writing of the block. On the other hand, sequential writes can be two orders of magnitude faster than random writes. In addition, random writes are degrading to the life of the flash drive, since each block can support only a limited number of erasures. TF-IDF can be implemented using a counting hash table. In general, hash tables are a particularly challenging case for the flash drive because this data structure is inherently dependent upon the randomness of the hash function, as opposed to the spatial locality of the data. This makes it difficult to avoid the random writes incurred during the construction of the counting hash table for TF-IDF. In this paper, we will study the design landscape for the development of a hash table for flash storage devices. We demonstrate how to effectively design a hash table with two related hash functions, one of which exhibits a data placement property with respect to the other. Specifically, we focus on three designs based on this general philosophy and evaluate the trade-offs among them along the axes of query performance, insert and update times and I/O time through an implementation of the TF-IDF algorithm.
In recent years, advances in hardware technology have led to the development of flash devices. These devices have several advantages such as faster seek times because of a lack of moving parts. Furthermore, they are more energy-efficient because of their use of non-mechanical techniques for data storage [3, 6]. Such drives are extremely fast for random read operations since they do not require the mechanical seeks necessary for disks.
While data access is extremely fast, the writes to the drive can vary in speed depending upon the scenario. Sequential writes are quite fast, though random writes can be over two orders of magnitude slower. The reason for this is the low level of granularity of erasing and updating data on disks. In fact, depending on the management of the Flash Transfer Layer (FTL), random updates in a flash drive may be significantly slower than random writes on disk in spite of the fact that the flash drive does not have any moving parts. On the other hand, random reads are extremely fast on the flash drive and are almost as fast as sequential reads. A second property of the flash drive is that it supports only a finite number of erase-write cycles, after which the blocks on the drive may wear out.
The different trade-offs in read-write speed leads to a number of challenges for database applications, especially those in which there are frequent updates to the underlying data. As a result, there has been a considerable amount of research [2, 3, 6, 17, 22, 23, 25] on database operations in flash storage devices. In particular, index structures are a challenge for the case of the flash drive because of their frequent updates with individual records. Such updates can be expensive, unless they can be carefully batched with the use of specialized update techniques. The idea is to minimize the block overhead in random writes. This approach also reduces the number of erase-write cycles on the flash, which increases its effective life.
The hash table is a widely used data structure in modern database systems . A hash table relies on a hash function to map keys to their associated values. In a well designed table, the cost of insertion and lookup requires constant (amortized) time, and is independent of the number of entries in the table. Such hash tables are commonly used for lookup, duplicate detection, searching and indexing in a wide range of domains including database indexing. A counting hash table is one in which in addition to the value associated with a key, a (reference) count is also kept up to date in order to keep track of the occurrences of a specific key-value pair. Counting hash tables are also widely used. In the programming languages and software engineering context, such tables can be used for object reference counting to aid in garbage collection and memory leak detection activities (e.g. Java JVM [12, 24]). In the data mining context, such tables are often used to efficiently count the number of occurrences of a given pattern (e.g. frequent pattern mining). In the database context, such tables are used for indexing (e.g. XML indexing, and selectivity estimation). In the computational linguistics and information retrieval context, such tables can be used to efficiently count the number of distinct words and the number of occurrences per word within a corpus or document collection .
TF-IDF , Term Frequency-Inverse Document Frequency, is a common technique used in text mining and information retrieval . TF-IDF measures the importance of a particular word to a document given a corpus of documents. Words that appear frequently, often referred to as stop-words e.g. , are given a lower TF-IDF score than words that are more rare as they are assumed to offer more information about a document’s subject e.g. . For query processing, such as a search engine, documents can be ranked by their relevancy using TF-IDF; the relevancy of a document increases if a word contained in a query has a high TF-IDF score. TF-IDF can also be used for document similarity. A set of keywords can be defined for each document; keywords are defined by those words with a TF-IDF score higher than a set threshold. Using a similarity measure between the resulting TF-IDF vectors of two documents can yield a similarity score between two documents. Computing the TF-IDF scores requires accumulating the occurrences of a term; this is an excellent application for counting hash tables.
Hash tables are an enormous challenge for the flash drive because they are naturally based on random hash functions and exhibit poor access locality. Thus, the key property of the hash table, randomness, becomes a liability on the flash drive. This paper will provide an effective method for updates to the tables in flash memory, by using a carefully designed scheme which uses two closely related hash functions in order to ensure locality of the update operations. Specifically, we will be designing a counting hash table. Counting hash tables pose an additional challenge since, unlike standard hash tables, a duplicate key-value pair requires an in-place update to the specific location. In-place updates are non-trivial and given their unpredictable nature, they place an additional burden beyond just insertions and lookups.
This paper is organized as follows. The remainder of this section will discuss the properties of the flash table which are relevant to the effective design of the hash table. We will then discuss related work and the contributions of this paper. In Section 2, we will discuss our different techniques. Section 3 contains the experimental results. The conclusions and summary are contained in Section 4.
1.1 Properties of the Flash Drive
The solid state drive (SSD) is implemented with the use of Flash memory; which comes in two types: NOR and NAND chips. NOR chips are faster and have higher lifetime but NAND chips have higher capacity and have been adopted in most commodity mobile devices using SSDs.
The most basic unit of access is a page which contains between 512 and 4096 bytes, depending upon the manufacturer. Furthermore, pages are organized into blocks each of which may contain between 32 and 128 pages. The data is read and written at the level of a page, with the additional constraint that when any portion of data is overwritten at a given block, the entire block must be copied to memory, erased on the flash, and then copied back to the flash after modification. This process is performed automatically by the software known as the Flash Translation Layer (FTL) on the flash drive. Thus, even a small random update of a single byte could lead to an erase-write of the entire block. Similarly, an erase, or clean, can only be performed at the block level rather than the byte level. Since random writes will eventually require erases once the flash device is full, it implies that such writes will require block level operations. Thus, the overhead for the case of random writes can be very large, unless one is careful about the techniques used for data modification. On the other hand, sequential writes on the flash are quite fast; typically sequential writes are two orders of magnitude faster than random writes.
Another technological limitation of the flash drive is that it can typically support only a limited number of erase-writes. After this, the blocks on the flash may degrade and they may not be able to support further writes. Typically, a flash drive can support between 10,000 to 100,000 erase writes. In this respect, random writes are extremely degrading to the flash because they may trigger many erase writes. Therefore, it is essential to batch as many updates as possible on blocks. This is particularly difficult for the case of the hash table because it often contains a large fraction of cells which are not updated and the writes are inherently random in nature.
1.2 Related Work
Flash devices have recently found increasing interest in the database community because of their fast random read and sequential write performance. Flash devices have been used in enterprise database applications , as a write cache to improve latency , and it is also used as an intermediate structure to improve the efficiency of migrate operations in the storage layer . Methods for page-differential logging for efficiently storing data on flash devices in a DBMS independent way are discussed in . Other database applications such as the design of dynamic self-tuning databases and the maintenance of database samples have been discussed in [28, 29]. There has also been work on designing tree-indexes on raw flash devices [17, 35] and indexes to deal with the random-write problem .
Rosenblum and Ousterhout proposed the notion of log-structured disk storage management  that relies on the assumption that the reads are cheap (as they are served from memory) and the writes are expensive (due to disk seeks and rotations). Not surprisingly, mechanisms similar to log-structured file systems are adopted in modern SSDs either at the level of FTL or at the level of file system to handle issues related to wear-leveling and erase-before-write [9, 15, 19, 18, 34]. As we discuss later, some of our buffering strategies are also inspired from log-structured file systems. Our design exploits the strength of flash-based storage devices in fast sequential writes, and tries to alleviate the problem of random writes.
There have been hash tables designed with SSDs including the work presented in  in the context of data intensive networked systems,  in the context of wimpy nodes,  is in the context of data de-duplication,  energy efficient memory sensors, and  persistent storage as write and/or read caches. In  designs a tree index for flash devices. However,  does not address duplicate keys thus it cannot handle a counting hash.
In this work, we design a counting hash table that maintain frequencies, and this has not been addressed thus far. We store the hash table on the SSD which is not seen in the designs of [5, 11]. Unlike most of the existing strategies that rely on simple memory-based buffering schemes, we design a novel combination of memory- and disk- based buffering scheme. Our method leverages the strengths of SSDs (fast sequential/random reads, fast sequential writes) to effectively address the weaknesses in SSDs (random writes, write endurance). We would like to emphasis that the works presented in this section do not handle a counting hash table, which is required by algorithms like TF-IDF, but our proposed hash table designs will.
1.3 Contributions of this paper
In this paper, we will design a hash table for the SSD. The hash table is a particularly challenging case compared to the case of conventional index structures because it is inherently dependent upon randomness as opposed to spatial locality. Index structures, which are dependent upon spatial locality, are much easier to update because the spatial locality can be leveraged to perform block updates of particular regions of the index. This is non-trivial for the case of the hash table in which the randomness guarantees that successive updates may occur at completely unrelated places on the hash table. As a result, it is much more difficult to cluster updates for the purpose of block updates in a hash table because successive updates may occur at widely unrelated places on the hash table. In this work, we make the following specific contributions – We propose a mechanism to support large counting hash tables on SSDs via a two-level hash function, which ensures that the random update property of flash devices is effectively handled, by localizing the updates to SSD; We devise a novel combination of memory- and disk- based buffering scheme that effectively addresses the problems posed by SSDs (random writes, write endurance). While the memory-resident buffer leverages the fast random accesses to RAM; the disk-resident buffer exploits fast read performance and fast sequential/semi-random write performance of SSDs; We perform a detailed empirical evaluation to illustrate the effectiveness of our approach by demonstrating the TF-IDF algorithm using our hash table.
2 A Flash-Friendly Hash Table
In this section, we will introduce a hash table which is optimized for flash storage devices. We will introduce a number of different schemes for implementing the hash-table as well as basic hash table operations on these designs.
The major property of a hash table is that its effectiveness is highly dependent upon updates which are distributed randomly across the table. On the other hand, in the context of a flash-device, it is precisely this randomness which causes random access to different blocks of the SSD. Furthermore, updates which are distributed randomly over the hash table are extremely degrading in terms of the wear properties of the underlying disk. This makes hashing particularly challenging for the case of flash devices.
Hash table addressing is of two types: open and closed, depending upon how the data is organized and collisions are resolved. These two kinds of tables are as follows.
Open Hash Table: In an open hash table, each slot of the hash table corresponds to multiple data entries. Thus, each slot is a container of entries which map onto that value of the hash function. Each entry of the collection is a key and frequency pair.
Closed Hash Table: In a closed hash table, the entries are accommodated within the hash table itself. Thus, the hash table slot contains the hashed string and its frequency. However, since multiple objects cannot be mapped onto the same entry, we need a collision resolution process, when a hashed object maps onto an entry which has already been filled. In such a case, a common strategy is to use linear probing in which we cycle through successive entries of the hash table until we either find an instance of the object itself (and increase its frequency), or we find an empty slot in which we insert the new entry. We note that a fraction of the hash table (typically at least a quarter) needs to to be empty in order to ensure that the probing process is not a bottleneck. The fraction of hash table which is full is denoted by the load factor . It can be shown that entries of the hash table are accessed on the average in order to access or update an entry.
In this paper, we will use a combination of the open and closed hash tables in order to design our update structure. We will use a closed hash table as the primary hash table which is stored on the (Solid State) drive, along with an update hash table which is open and available in main memory. The hash functions of the two tables are different (since the number of entries in the secondary hash table are much lower than the first), but they are related to one another in a careful way, so as to guarantee locality of updates. We will discuss this slightly later. The secondary hash table is updated for each incoming record; from time to time, portions of the secondary table are used in order to update the primary table in batch mode. The batch-updates are scheduled in a way so as to minimize the wasteful erase-writes in the update process.
We assume that the primary hash table contains entries, where is dictated by the maximum capacity planned for the hash table for the application at hand. The secondary hash table contains entries where . The hash function for the primary and secondary hash tables are denoted by and , and are defined as follows:
In general, the scheme will work with any pair of hash functions and which satisfy the following relationship:
It is easy to see that the entries which are pointed to by a single slot of the memory-resident table are located approximately contiguously on the drive-resident (closed) table, because of the way in which the linear probing process works. This is an important observation, and will be used at several places in ensuring the efficiency of the approach. Linear probing essentially assumes that items that collide onto the same hash function value will be contiguously located in a hash table with no empty slots between them. Specifically, the th slot on the secondary table, corresponds to entries starting from upto entry in the primary table. We note that most entries which would be pointed to by the th slot of the secondary table would also map onto the afore-mentioned entries in the primary table, though this would not always be true because of the overflow behavior of the linear probing process beyond these boundaries.
2.1 Desirable Update Properties of an SSD-based Hash-Table
In this subsection, we will provide a broad overview of some of the desirable update properties of a hash-table. In later subsections, we will discuss how these goals are achieved. A naive implementation of a hash table will immediately issue update requests to the hash table as the data points are received. The vast majority of the write operations will be random page level writes due to the lack of locality, which is inherent in hash function design. As mentioned before, the cost of such operations will also increase the cost of cleans and random writes.
A desirable property for a hash table would be block-level updates and semi-random writes. The block-level update refers to the case when there are multiple updates written to a block, and they are all accomplished at one time. If there are updates written to a block, we should combine them into one block-level write operation. This can reduce the number of cleans from to one. The semi-random writes refer to the fact that the updates to a particular block are in the same order as they are arranged on the block, even though updates to different blocks may be interleaved with one another.
We give an example of semi-random writes. If we consider the pair sequence , , , , , , , , , this is considered semi-random, because the page updates to a particular block are arranged sequentially by their order of page id. Recall that sequential write patterns improve latency . This is because of how the flash translation layer works. The existing methods in the flash translation layer are typically lazy; when the th logical page of a block is written, the FTL copies and writes only the first pages to a newly allocated block, instead of all the pages. Later, when page is modified, only the pages to are moved and written to the new block. Note that this would not be possible if were less than . The semi-random writes would improve the write latency of an SSD because SSD write performance improves under sequential pattern of writes . Thus, the sequential ordering is useful in minimizing the unnecessary copying of pages from old blocks to new blocks.
2.2 Hash table designs
A variety of low level structures can be maintained in order to accomplish the desirable properties discussed earlier. We will design a number of such hash table maintenance schemes. All of these schemes use a combination of these low level structures in different ways. However, we would like to introduce these low level structures at this stage in order to ease further discussion. Recall that we combine an open hash table in main memory with a closed-hash table on the SSD. This open (or secondary) hash table is typically implemented in the form of a RAM buffer denoted as . The RAM buffer will contain updates for each block of the SSD and execute batch updates to the primary hash table on disk, or data segment (denoted by ), at the block level. This approach can reduce block level cleaning operations.
2.3 Memory Bounded Buffering
The overall structure of the common characteristics of the hash table architectures presented in this paper is illustrated in Figure 1. We refer to this scheme as Memory Bounded Buffering or MB. The RAM buffer in the diagram is an open (or secondary) hash table and the data segment is a closed (or primary) hash table. There are slots, each of which corresponds to a block in the data segment. The maximum capacity of the data segment is pages, pages per block and entries per page. Thus, the number of slots in the secondary hash table, , must be equal to . Updates are flushed onto the SSD one block at a time. Because of the relationships between the hash functions of the primary and secondary table, the merge process of a given list requires access to only a particular set of SSD blocks which can be maintained in main memory during the merging process. This may of course involve the insertion of new items that are not present in the data segment and items that collide with entries already inside of the data segment.
2.4 Memory and Disk Bounded Buffering
Since is main-memory resident, it is typically restricted in size. Therefore, a second buffer can be implemented on the SSD itself. This new segment is referred to as the change segment or . The change segment acts as a second level buffer. When exceeds its size limitations, the contents are sequentially written to the change segment at the page level starting from the first available page in an operation known as staging. When full, the change segment merges with the data segment and begins from the top of the change segment. A page in the change segment may contain updates from multiple blocks because pages are are packed with up to entries irrespective of their slot origin. Thus, the change segment is organized as a log structure that contains the flushed updates of the RAM buffer. This takes advantage of the semi-sequential write performance of the SSD and increases the lifetime of the SSD. The space allocated to the change segment is in addition to the space allocated to the data segment. This hash table (with change-segment included) is illustrated in Figure 2. It is important to note that a stage() operation differs from a merge() operation in two ways, Specifically, stages write at the page level while merges operate at the block level. Furthermore, stages involve updates to the change segment while merges involve updates to the data segment.
There are two types of architectures for the change segment. In the first design, the change segment is viewed as a collection of blocks where each block holds updates from multiple lists from . In other words, multiple blocks in the data segment are mapped to a single block in the change segment. We arrange the change segment in a way such that each change segment block holds the updates for data segment blocks. The value of is constant for a particular instantiation of the hash-table, and can be determined in an application-specific way. For an update-intensive application, it is advisable to set to a smaller value at the expense of SSD space.
When a particular change segment block is full, we merge the information in the change segment to the data segment blocks. The advantage of this method is best demonstrated if the RAM buffer is small. In that case, it will cause frequent merges onto the data segment under the MB design of the hash table. By adding the change segment, we are providing a more efficient buffering mechanism. Staging a segment is more efficient than merging it because the change segment is written onto the SSD with a straightforward sequential write, which is known to be efficient for SSD. This approach is called Memory Disk Buffering or MDB.
In this variation of the MDB scheme, (which we henceforth will refer to as MDB-L for MDB-Linear) the space allocated for the change segment is viewed as a single large monolithic chunk of memory without any subdivisions. This view resembles a large log file. Thus, the change segment blocks are not assigned to data segment blocks. The writes to the change segment are executed in FCFS fashion. This type of structure mimics a log-structured file system and fully takes advantage of the SSD strength in sequential writes. We maintain a collection of pointers to identify the ranges, measured in pages, that a particular slot in the RAM buffer has been staged. These pointers are similar to the indexing information  maintained in log-structured file systems that helps in reading the files from the log efficiently.
A merge operation is triggered when the change segment is full. The collection of pointers can be used to identify the pages a particular block was staged. This process produces random reads on the change segment because the ranges span multiple stage points. The reads are also repetitive because a page may contain entries from multiple blocks because of the staging process. During a stage, entries from multiple blocks may be packed into a single page. During a merge, each page will be requested by each data segment block that has entries staged onto it. After all of the pages for a particular data segment block are read from the change segment, the entries are merged with the corresponding data segment block.
2.5 Element Insertion and Update Process
The element insertion process is designed to perform individual updates on the memory-resident table only, since this can be done in an efficient way. Such changes are later rolled on to the RAM buffer (which is in turn rolled on to the change segment for some of the schemes).
For each incoming record , we first apply the hash function in order to determine the slot to which the corresponding entry belongs. We then determine if the key is present inside the corresponding slot . If the element is found, then we increase its frequency. The second case is when the key is not contained inside the buffer which is pointed to by the slot . In such a case, we add the key as a new element to the RAM buffer. The size of the RAM buffer increases in this case. If the RAM buffer has grown too large, it is flushed either directly onto the change segment or the SSD itself, depending upon whether or not the change segment is implemented in the corresponding scheme. Because of the relationship between the hash functions of the RAM buffer and the SSD based hash table, such an update process tends to preserve the locality of the update process, and if desired, can also be made to preserve semi-random write properties.
During the insertion process of new items, linear probing may occur because is a closed hash table. If the linear probing process reaches the end of the current SSD block, then we do not move the probe onto the next block. Rather, an overflow region is allocated within the SSD table which takes care of probing overflows beyond block boundaries. The last index of the last page of an SSD block becomes a pointer assigned to the overflow region. The entry that was resident at this position now resides in the conflict region alongside the newly inserted entry. Thus, the data segment is a collection of blocks with logical extensions. The overflow region, a collection of SSD blocks, is allocated when the hash table is created. Its size can vary depending on user specifications. The blocks are written one at a time and the pages are assigned when needed. When a overflow region is needed, a page is assigned. If another region is needed, another page is allocated and the previous page points to the newly allocated page. When a block is full, another block is used. An overflow block may contain entries from multiple blocks in the SSD data segment because the regions are allocated a page at a time when they are needed.
2.6 Implementing Deletion Operations
It is also possible to implement deletion operations in the hash table. There are two kinds of deletion possible:
A deletion operation reduces the frequency of an item by 1. This case is trivially solvable by using the approach for insertion, except that we use a frequency of -1 for the incoming element. Entries with frequency value of 0 are not retained in the memory-resident table, and entries with negative frequencies are allowed. These negative frequencies are appropriately transferred to the drive-resident table during batch updates.
A deletion operation completely removes an element from the hash table irrespective of its frequency. This case is more complex, and is discussed below.
If an item is deleted from the data segment, it can either be removed or its frequency can be set to zero. If it is removed, there will be added complexity to queries, updates, and inserts. This is because of the way in which the linear probing process is implemented. During a query, if an empty slot is encountered during the probing process, the query terminates (with the guarantee that the item is not found anywhere else in the hash table) because of this contiguity assumption. However, the removal of entries in a deletion process can violate this contiguity assumption. This is because the empty slot encountered during linear probing may be the result of a previously filled slot, which was removed by a deletion process. In such a case, the desired entry may reside beyond the empty slot, and it may no longer be possible to terminate after the first instance of an empty slot. This could potentially invalidate the correctness of the query processing.
This problem can be handled during the merge of a block or periodically. In both cases, the block is loaded into main memory. The entries are hashed inside a main memory block, but the deleted items are ignored during this process. This will ensure that the entries are contiguous. The newly hashed entries are then re-written directly to the data segment block.
2.7 Query Processing
In the simple hash table, queries are fulfilled by an I/O request to the data segment. However, in our proposed designs the corresponding entry may be found either in the change segment or the RAM buffer. Therefore, the query processing approach must search the change segment and the RAM buffer in addition to the data segment. Thus, the frequency of a queried item is the total frequency found in the change segment, RAM buffer, and data segment. The search of the RAM buffer may be inexpensive because it is in main memory. On the other hand, access to the change segment requires access to the SSD.
For the case of the data segment, the query processing approach is quite similar to that of standard hash tables. A hash function is applied to the queried entry in order to determine its page level location inside of the data segment. If the entry is found, the frequency is returned. If the item is not found, linear probing begins because the disk hash table is a closed hash. Linear probing halts if the entry is found or an empty entry was discovered. The query processing of the change segment requires locating the entry. The location of the entry may reside in multiple segments due to repeated flushing of the RAM buffer.
Recall that MDB partitions the change segment. When a RAM bucket is staged, it is always written to the same change segment block. We locate the appropriate change segment block and bring it into memory to be searched. In MDB-L, RAM buckets can reside on multiple pages, and thus we must issue random page reads. We expect MDB-L to be faster because of page level access.
In this section we present an empirical analysis of the hash table designs discussed in the previous sections. We evaluate the performance of the three main schemes discussed in this article, namely MB, MDB and MDB-L. Broadly, our objectives are to understand the I/O overheads of various schemes and their query performance. Additionally, since SSD disks permit a limited number of clean operations, it is also important to quantify the wear rate of the devices. We begin with a discussion of the experimental setup.
3.1 Experimental Setup
To evaluate our hash table configurations, we used the DiskSim simulation environment , managed by Carnegie Mellon University; and the SSD Extension for this environment created by Microsoft Research . We used the Disksim simulator with an SSD extension by Microsoft Research. We operated Disksim in slave mode. Slave mode allows programmers to incorporate Disksim into another program for increased timing accuracy.
We ran our experiments on three different configurations of the latest representative NAND flash SSDs from Intel (see Table1 for details). Among these, two SSDs are MLC (Multi-Level Cell) and the other is SLC (Single-Level Cell) based SSDs. We have chosen from both MLC and SLC because of their differing characteristics. While MLCs provide much higher data density and lower cost (which makes it more popular), it has a shorter lifespan and slow read/write performances. SLC, on the other hand, has faster read/write performances and a significantly longer lifespan. SLCs also entail lower internal error rate making them preferable for higher performance, high-reliability devices .
All hash table experiments involve inserting, deleting, and updating key value pairs. The size of the RAM buffer is parameterized on the size of the data segment and expressed as a percentage The rationale here is that we believe that an end application may need to create multiple hash tables on the same SSD. Moreover, the characteristics of access may vary across applications (i.e. one may want different RAM buffer sizes for each hash table). The change segment is likewise parameterized and the overflow segment for all experiments was set to a minimal value (one block) since this was found to be sufficient. Key-value pairs are integer pairs.
We conducted our experiments on a DELL Precision T1500 with an Intel ® Core ™ i7 CPU firstname.lastname@example.orgGHz with 8Gbs of memory and 8 cores running Ubuntu 10.04. Our code was implemented in C++. The data structure utilized the C++ Standard Template Library  for its implementation. The RAM buffer buckets that correspond to data segment blocks are arranged inside a C++ vector and their indexes correspond to their placement on the data segment. For example, the first block inside the data segment corresponds to the first block in the RAM buffer. The data segment can be viewed as an array logically divided into blocks and further divided into pages.
To demonstrate the efficacy of our methods, we implemented the TF-IDF algorithm, see Section 1, using our hash table designs. In our experiments, we compute the TF-IDF score of all words in our corpus. Our hash table contains the frequencies of each keyword. As we read in each document, we compute the frequency of each word and store it in our counting hash table.
|Page Size (KB)||4||4||4|
|Sustained||Upto 170||Upto 250||Upto 250|
|Sustained||Upto 35||Upto 70||Upto 170|
3.3 Data Sets
We use two datasets: a Wikipedia and MemeTracker dump, which are essentially large text files.
Wiki: The first data set we use is a collection of
randomly collected Wikipedia articles. We chose random
wikipedia articles collected from Wikipedia’s publicly available
To evaluate I/O performance during inserts or updates
To evaluate query performance we first processed 35 million tokens. Subsequently, roughly 100 million words were inserted. Simultaneously with inserts, we also issued a million queries interleaved randomly across inserts. A query is a hash table lookup. In the TF-IDF context, this corresponds to ”how frequent is a keyword” which allows us to compute the TF-IDF score of a keyword. Of these queried items, on average (spread across 10 different random workloads) of them were present inside the hash table at query time.
Meme: The second data set we report is the
I/O performance is evaluated similar to how we handle things for the Wiki dataset. For query performance, the first 130 million words were inserted into the hash table. Subsequently, the remaining 270 million words were interleaved with about one million queries. Of these queried items, (spread across 10 random workloads) of them were found inside the hash table.
3.4 Query Time Performance
In all our graphs, the Y-axis is the average time per query in milliseconds. Results on both Wiki and Meme are provided in Figure 6. The main trends we observe include: i) the query time for MB are quite low (does not have a change segment); ii) the query time for MDB is quite high and does not drop significantly with reduction in change segment size; iii) the query time for MDB-L improves dramatically with a reduction in the change segment size; and iv) query times for Meme are marginally lower than the query times for Wiki for both MDB and MDB-L . These trends can be explained as follows. Query costs for MB are essentially fixed since they essentially have to combine the counts from the memory buffer (negligible) and require typically a page read to access the requisite information from the data segment. Query costs for MDB require consolidation of information from the memory buffer (negligible) and from the change segment (expensive – dominated by block level reads) and the data segment (usually a single page read). Query costs for MDB-L require consolidation across the memory buffer (negligible), the change segment (typically requiring a few page reads which are significantly reduced as the size of the change segment is reduced) and the data segment (usually a single page read). This is reflected in our first experiment, shown in Figure 6 for MLC-1, in which we varied the change segment while fixing the RAM buffer to 5%. With regards to the difference between Wiki and Meme query times, upon drilling down into the data, we find that on average there are 11.5% more page reads for Wiki. This may be an artifact of the linear probing costs within both datasets, given the fact that the ratio of number of unique tokens to hash table size is slightly higher for Wiki.
For the second experiment, again on the MLC-1 configuration, we fixed the change segment to 12.5% and varied the the RAM buffer for both datasets(see Figure 6). We observe that with an increase in RAM buffer size that : i) MB shows a negligible change in average query time; ii) MDB shows decrease in average query time; iii) MDB-L shows a significant decrease in average query time performance; and iv) query times on Meme are typically faster than those on Wiki.
To explain these trends we should first note that increasing RAM buffer size has the general effect of reducing the number of stage operations, and thus the average size of the amount of useful information within the change segment. Thus the time it takes to consolidate the information within the change segment in order to answer the query, is on average lower, for both MDB and MDB-L. For MDB-L the improvement is more marked because fewer page reads are required. The explanation for why query times are lower for Meme are similar to what we observed for the previous experiment.
The third experiment we performed on query time performance was to evaluate the performance of the three SSD configurations on the Wiki dataset shown in Figure 6. Here the RAM buffer was set to 5% and the Change Segment was set to 12.5%. The results are along expected lines in that average query times are slightly better on MLC-2 and SLC over MLC-1 for the MDB method. The superior read performance for both page level and block level operations are the primary reason. This difference is marked in the case of MDB but for both MDB-L and MB the difference is negligible. MDB requires a block level read for a query and the performance difference for this type of operation is more pronounced for MLC-2 and SLC, over when compared with MLC-1.
To conclude we should reiterate that the query performance times we observe here are for our update-intensive query workload where we interleaved queries with inserts (averaged over multiple runs). In this environment the query time performance of MB is always the fastest. For reasonable parametric settings MDB-L typically approaches MB in performance while MDB is always an order of magnitude worse in terms of performance. We should note that we also evaluated query times for all three methods in more stable settings (few updates/inserts) In such a stable setting we found that the query times for all three methods was identical. The query cost essentially boils down to a page read or two on the data segment (since the change segment is empty and does not factor). Furthermore, MDB is bounded by a single block read while the query time of MDB-L may vary. However, as our results indicate, the pointer guided page level accesses of MDB-L provide efficient read access that outperforms MDB.
3.5 I/O Performance
In this section we examine the I/O performance of the three strategies. To ignore the impact of queries in this section, our workloads for both datasets simply insert all the tokens or words into their respective hash tables. In our first experiment, we report overall I/O cost from the perspective of the SSD for the three SSD configurations for both Wiki (see Figure 9) and Meme (see Figure 9). The RAM buffer is set to 5% and the change segment is set to 12.5% in this experiment.
The main trend we observe are that both MDB-L and MDB require comparable yet significantly lower I/O costs than MB. This is primarily attributable to the presence of the change segment which enables sequential (MDB-L) or semi-random writes (MDB). Additionally, as we shall see shortly, MB requires a large number of erasures which also contribute to the overall I/O cost. Another trend we observe is that among the SSD configurations SLC and MLC-2 offer comparable performance with a slight edge to SLC. MLC-1 is quite a bit slower. This is primarily attributable to the superior write bandwidth of SLC and MLC-2. Finally we observe that the overall I/O times are higher for Meme over Wiki (larger dataset and larger hash table).
Not shown in our reports are the performance measures for a hash table without the use of a buffer. The advantage of this scheme is fast query times because queries are only page level reads on the data segment. However, results show that such a hash table would induce 1,680,323 cleans for the Wiki dataset and 6,669,932 cleans for the Meme dataset. The I/O performance are on the order of 615 times slower on the Wiki dataset and 1500 times slower for the Meme dataset for reported times for the results in Figure 9. This increase is caused by cleaning time and random page writes. It is clear that there is a benefit from our designs.
Next we discuss the breakdown of page and block level operations, merge and stage operations.
To better understand the I/O performance, we drill down a bit further on some of the core operations within the I/O subsystem for all three methods. Table 2 shows the block and page level operations, number of merges and number of stages for each method for varying RAM buffer percentages and varying change segment sizes on the Wikipedia dataset. For each method, ”Block” represents the number of block operations and ”Page” represents the number of page level operations. The percentage information with the Block value represents the ratio of Block level operations to total number of (Block + Page) operations. Columns ”Merges” and ”Stages” list the number of merges and stages for a method, respectively. Note that MB has no page level operations and does not leverage staging.
Before we discuss the main trends we should highlight that the number of merges for each of the methods is quite different. It should be noted that a merge operation for each of the methods is not exactly the same. Recall that a merge for MB essentially entails unifying the contents of the memory buffer with the data segment. In fact the number of merge operations for MB is identical to the number of stage operations for the other two methods. Staging in both these methods involve unifying the contents of the memory buffer with the change segment via sequential writes or semi-random writes. A merge operation for MDB-L entails unifying the entire contents of the change segment with the data segment. A merge operation for MDB entails merging the contents of one block within the change segment (that has filled up) with the contents of the data segment. This explains the difference in the number of merges for these methods.
The main trends we observe from this table are noted below.
The number of basic I/O operations (Block, Page) and meta-level I/O operations (Merges, Stages) (and therefore the total I/O cost) drops as we increase the RAM buffer size. The rationale for this is obvious.
The ratio of block level operations to page level operations increases both with increasing RAM buffer size and with reducing the change segment size for both MDB and MDB-L. The number of stages decreases faster than the number of merges if the RAM buffer is increased. If the change segment size is reduced, the number of merges will increase because the change segment will fill more frequently.
MDB-L typically has the lowest number of block level operations while MB requires the larges number of block operations (which is significantly more expensive than page operations.
4The low block operations of MDB-L can be attributed to the linear change segment.
In terms of overall I/O costs MDB and MDB-L have a similar profile while MB is significantly more expensive.
Summing up the I/O performance it is fair to say that for most reasonable parameter settings MDB and MDB-L significantly outperform MB in terms of the cost of I/O from the perspective of the flash device. Additionally it should be noted that the merging operation within both MDB and MDB-L will happen completely within the SSD (allowing for an overlap of CPU operations – not reflected in any of the experiments) whereas a merge for MB and staging for the other two methods will require some CPU intervention. Also note that an MB merge operation is significantly more expensive than and MDB or MDB-L stage operation (random writes versus semi-random/sequential writes).
In our next experiment we take a closer look at the number of clean operations required by these methods for both datasets (see Figure 13). Our graphs display the variation of RAM buffer size for both datasets and the variation of change segment size for the Wiki dataset along the Wiki dataset for the X-axis. The Y-axis is the amount of erasures. The main trends and explanations for these trends are: i) the number of cleans goes down with increasing RAM buffer sizes since there are fewer stages and merges as shown in Figure 13; ii) the number of cleans is significantly higher for MB compared to that of the other two methods because the change segment provides an extra level of buffering for MDB and MDB-L as shown in Figure 13; iii) the number of cleans increases for MDB and MDB-L as we decrease the size of the change segment because the change segment fills more often and thus there are more merges. MB does not use the change segment so it stays a constant value. and iv) the number of cleans for MDB and MDB-L are very similar with MDB-L being slightly better. The reduction for MDB-L is clearly attributable to the linear change segment design.
Hash tables pose a challenge for flash devices. Updating an entry inside of a disk hash table may trigger an entire erasure of an SSD block. Repeatedly updating a hash table can be detrimental to the limited lifetime of the underlying SSD. A simple hash table without buffering can be implemented. It has superior query time but it induces a substantial amount of cleans and I/O cost. From our experiments, we believe that an SSD friendly hash table will have a RAM buffer and a disk based buffer that supports semi-random writes. These features will increase the locality of updates and reduce the I/O cost of the hash table for both low and high end SSDs. Overall our results reveal that when one accounts for both I/O performance and query performance, MDB-L seems to offer the best of both worlds on the workloads we evaluated for reasonable parameter settings of change segment size and RAM buffer size. Furthermore, MLC-2 seems to offer the best trade-off currently when taking into account both cost and performance of the device. In the future, we would also like to extend our design to hash functions that do not rely on the mod operator (e.g. extendible hashing) and examine various checkpointing methods.
- See http://dumps.wikimedia.org/
- As noted earlier deletes are handled as inserts with a negative count.
- See http://memetracker.org/
- MLC-1 is on the order of 30 and 50 times more expensive for block level reads and block level writes, MLC-2 is over 25 and 35, and SLC is over 24 and 28 respectively.
- A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In VLDB, 2001.
- D. Agrawal, D. Ganesan, R. Sitaraman, Y. Diao, and S. Singh. Lazy-adaptive tree: An optimized index structure for flash devices. VLDB, 2009.
- N. Agrawal, V. Prabhakaran, T. Wobber, J. Davis, M. Manasse, and R. Panigrahy. Design tradeoffs for SSD performance. In USENIX, 2008.
- A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems. USENIX, 2010.
- D. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In SOSP, 2009.
- A. Birrell, M. Isard, T. C., and T. Wobber. A Design for High Performance Flash Disks. SIGOPS, 2007.
- J. S. Bucy, J. Schindler, S. W. Schlosser, G. R. Ganger, and et al. The disksim simulation environment version 4.0 reference manual, May 2008.
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms 2nd edition. The MIT Press, 2001.
- H. Dai, M. Neufeld, and R. Han. ELF: an efficient log-structured flash file system for micro sensor nodes. In SenSys, 2004.
- B. Debnath, S. Sengupta, and J. Li. ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory . In USENIX, 2010.
- B. Debnath, S. Sengupta, and J. Li. FlashStore: High Throughput Persistent Key-Value Store. In VLDB, 2010.
- S. Friedman, P. Krishnamurthy, R. Chamberlain, R. K. Cytron, and J. E. Fritts. Dusty Caches for Reference Counting Garbage Collection. In MEDEA, 2005.
- G. Graefe. The Five-minute Rule Twenty Years Later, and How Flash Memory Changes the Rules. DaMoN, 2007.
- L. Hewlett-Packard Development Company. Solid state drive technology for ProLiant servers: Technology brief, October 2008.
- C. Hyun, Y. Oh, E. Kim, J. Choi, D. Lee, and S. Noh. A Performance Model and File System Space Allocation Scheme for SSD. MSST, 2010.
- K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21, 1972.
- D. Kang, D. Jung, J.-U. Kang, and K. J-S. -tree: An ordered index structure for NAND flash memory. ICESS, 2007.
- H. Kim and U. Ramachandran. FlashLite: A User-Level Library to Enhance Durability of SSD for P2P File Sharing. In ICDCS, 2009.
- J. Kim, J. Kim, S. Noh, S. Min, and Y. Cho. A space-efficient flash translation layer for compactflash systems. ICCE, 2002.
- Y.-R. Kim, K.-Y. Whang, and I.-Y. Song. Page-differential logging: an efficient and dbms-independent approach for storing data into flash memory. In SIGMOD, 2010.
- I. Koltsidas and S. D. Viglas. Flashing Up the Storage Layer. VLDB, 2008.
- S. Lee and B. Moon. Design of flash-based DBMS: an in-page logging approach. SIGMOD, 2007.
- S. Lee, B. Moon, C. Park, J.-M. Kim, and S.-W. Kim. A case for flash memory SSD in enterprise database applications. SIGMOD, 2008.
- Y. Levanoni and E. Petrank. An on-the-fly reference-counting garbage collector for java. TOPLAS, 2006.
- Y. Li, B. He, Q. Luo, and K. He. Tree Indexing on Flash Disks. ICDE, 2009.
- Y. Li, B. He, R. J. Yang, Q. Luo, and K. Yi. Tree indexing on solid state drives, 2010.
- Microsoft. SSD Extension for DiskSim Simulation Environment, March 2009.
- S. Nath and P. Gibbons. Online maintenance of very large random samples on flash storage. VLDB, 2010.
- S. Nath and A. Kansal. FlashDB: dynamic self-tuning database for NAND flash. IPSN, 2007.
- R. Relue and X. Wu. Rule Generation With the Pattern Repository. In ICAIS, 2002.
- M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. TOCS, 1992.
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. In INFORMATION PROCESSING AND MANAGEMENT, pages 513–523, 1988.
- B. Stroustrup. The C++ Programming Language. Addison-Wesley Professional, 3rd edition, June 1997.
- D. Woodhouse. JFFS: The journalling flash file system. Red Hat, 2001.
- C.-H. Wu, T.-W. Kuo, and L.-P. Chang. An efficient B-tree layer implementation for flash-memory storage systems. TECS, 2007.
- D. Zeinalipour-Yatzi, V. Kalogeraki, D. Gunopulos, and W. Najjar. MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices. In In FAST, 2005.
- G. Zweig, P. Nguyen, J. Droppo, and A. Acero. Continuous Speech Recognition with a TF-IDF Acoustic Model. In ISCA, 2007.