Mithril: Mining Sporadic Associations for Cache Prefetching

Mithril: Mining Sporadic Associations for Cache Prefetching

\SetAlFnt

Abstract

The growing pressure on cloud application scalability has accentuated storage performance as a critical bottleneck. Although cache replacement algorithms have been extensively studied, cache prefetching – reducing latency by retrieving items before they are actually requested – remains an underexplored area. Existing approaches to history-based prefetching, in particular, provide too few benefits for real systems for the resources they cost.

We propose Mithril, a prefetching layer that efficiently exploits historical patterns in cache request associations. Mithril is inspired by sporadic association rule mining and only relies on the timestamps of requests. Through evaluation of 135 block-storage traces, we show that Mithril is effective, giving an average of a 55% hit ratio increase over LRU and Probability Graph, a 36% hit ratio gain over Amp at reasonable cost. We further show that Mithril can supplement any cache replacement algorithm and be readily integrated into existing systems. Furthermore, we demonstrate the improvement comes from Mithril being able to capture mid-frequency blocks.

1 Introduction

As cloud tenants use increasing volumes of data, the pressure mounts on the underlying storage systems to prevent high access latencies for end-users. The prevalent techniques for mitigating block storage access latencies are to cache recently accessed blocks [26], and to prefetch blocks into the cache in advance of anticipated accesses [29, 14].

Current approaches to cache prefetching can be divided into two schools. On one hand, sequential prefetching techniques (such as AMP [7]) anticipate access to consecutive block identifiers, but rely on block I/O with progressive data layout. On the other hand, history-based prefetching seeks to find and exploit deep correlations among past accesses but normally at substantial computational cost [18]. To mitigate overhead and to make caching and prefetching more effective, several applications choose to provide additional hints [23] with each access [4, 9, 27, 18, 19]. Passing extra information, however, requires restructuring, reorganization or modification to the software stack [23], and is infeasible in scenarios where parts of the stack is proprietary.

We argue that to avoid becoming a latency bottleneck, modern block storage systems need general prefetching techniques that fulfill the following criteria.

  • Exploit history. Various lower layers of storage systems perform sequential prefetching so the focus should be on the more spatially and temporally sophisticated patterns of reuse.

  • Low overhead. The methods must be simple, on-line and impose low time and space overhead.

  • Backward compatible. The methods should implement standard legacy interfaces and treat other parts of the storage system as a black-box.

Existing approaches fall short of one or more of these goals: probability graphs and variants incur intensive space or computation overhead [10, 18, 29]; QuickMine is an online algorithm but relies on hints from the applications through modified interfaces [23] with extra hints from system or applications.

In this paper, we propose Mithril, a lightweight online history-based prefetching layer which meets all of the goals. Mithril can be coupled with any existing caching layer, even composed with a sequential prefetching layer such as AMP [7]. Mithril harnesses several concepts from sporadic association rule mining [16] from the data mining literature. The central idea behind Mithril is to track temporal associations between only those blocks whose access patterns are moderately frequent. Intuitively, items that are accessed regularly are already handled by an underlying caching system, such as LRU, whereas items that are rarely accessed need not occupy the precious cache memory. Mithril detects associated access patterns between pairs of blocks without relying on application-level hints. In contrast to other history-based prefetching algorithms [18, 19, 10], Mithril is able to discover relationships between interleaved requests that are not consecutive – a ubiquitous scenario in modern multi-tenant storage systems – without incurring high computation overhead. The focus of this paper is on exploiting patterns in block I/O workloads, but evidence shows that Mithril works on proxy workloads as well. We evaluated Mithril through experiments on traces from a commercial I/O caching analytics service, CloudPhysics [26], as well as file system traces from Microsoft Research (MSR) [22]. We found that Mithril boosts the cache hit ratio by up to 7 over typical cache strategies (LRU) improves over the state-of-the-art sequential prefetching algorithm Amp by 36% on average.

Our paper makes three contributions.

  • A design of a history-based prefetching layer Mithril that leverages a novel, low-overhead algorithm to mine for regularity in request timestamps in an optimized manner.

  • A trace-driven experimental evaluation of Mithril on 135 traces from real storage systems, showing that our Mithril layer effectively discovers block associations for prefetching. On average, Mithril increased hit ratio by 56% in over LRU, and 36% over Amp. We also measured the latency of Mithril on a real system.

  • A demonstration that Mithril discovers associations between separated blocks from interleaved applications, and the power of Mithril comes from capturing mid-frequency blocks.

2 Background and Motivation

Algorithm \makecellTime
overhead \makecellSpace
overhead \makecellAvg. hit ratio
improvement \makecellMax. hit ratio
improvement Online \makecellBackward
compatible General
Amp [6] Low Low 12.2% 139% \checkmark \checkmark
PG [10] Low High 4.1% 156% \checkmark \checkmark \checkmark
C-Miner [18] High Moderate N/A N/A \checkmark \checkmark
QuickMine [23] Moderate Moderate N/A N/A \checkmark \checkmark
Mithril Moderate Moderate 54.3% 740% \checkmark \checkmark \checkmark
Table 1: Comparison of common prefetching approaches. Overhead and improvement is measured over over LRU on 135 traces (see Sec. 5). Backward compatible algorithms require no hints or changes to legacy interfaces. General approaches generalize beyond block I/O traces.

Caching has been widely studied over the past 70 years. The standard algorithm of evicting the least-recently-used elements (LRU) has seen some structural improvements over the years [15, 30, 21, 24]. A complementary approach is to prefetch data into the cache before it is used, typically either based on sequential or history-based patterns [23, 29]. We argue there is room for improvement for prefetching on block I/O workloads.

Sequential prefetching is exploited at lower layers.

In sequential prefetching, the storage server exploits spatial locality in the I/O request stream by retrieving a batch of consecutive blocks upon detecting a sequential access pattern [6, 17]. Static size sequential prefetching is well-understood, simple to implement and has seen long deployment, but can cause cache pollution in workloads where the sequential correlation length is variable and affect accuracy.

Cloud environments, however, exhibit high levels of concurrency. This results in I/O workloads where multiple applications interleave I/O accesses that break the continuity of consecutive access patterns [23]. Adaptive algorithms such as Amp (Adaptive Multi-stream Prefetching) [6, 7] and TaP (Table-based Prefetching) [17] dynamically decide when and how much to prefetch. Amp, for instance, dynamically adjusts the number of pages to be prefetched to prevent both cache pollution and prefetch wastage when the requests stream is interleaved. Amp increases its prefetch degree if the prefetched blocks are waited on by system, and decreased if prefetched blocks are evicted. Unlike other prefetching algorithms, which use read cache to detect sequential streams, TaP uses a table to detect sequentiality and track longer history. Thus, TaP outperforms Amp on interleaved workloads and at small cache sizes.

Sequential prefetching has been widely deployed and commonly used in operating systems [20, 2], databases [25] and storage controllers [8]. The ubiquity and success of the approach at lower layers, however, makes the approach less attractive for higher layers in the storage hierarchy, such as at the virtualization layer. The length of contiguous I/O sequences, furthermore, tend to be short at the lowest levels of the storage hierarchy [29] due to virtualization, multi-tenancy, disk encryption and sophisticated file system layouts. Together, these trends reduce the effectiveness of sequential prefetching on today’s storage system workloads.

History-based prefetching has been expensive. History-based prefetching, in contrast, tolerates discontinuity across repeating patterns at the cost of added complexity and overhead [14, 10]. One approach is to generate a directed probability graph over data items, where arc denotes one item is likely accessed before the other, and each arc is weighed by the probability of such an access [10, 1, 11, 29]. Many systems following this direction prevent graph metadata from becoming unwieldy by operating at the file-level instead of the block-level [10, 1, 11], which has inherent limitations [14].

Another take on history-based prefetching is to leverage data mining techniques to identify repeating sequences. By mapping a block to an item, using frequent sequence mining on the request sequence, we can obtain frequent subsequences in an access stream. A frequent subsequence implies that the involved blocks are frequently accessed together. In other words, frequent subsequences are good indicators for block correlations in a storage system. C-Miner [18] and QuickMine [23] employ this technique to discover block correlations in storage systems. However, precise data mining technique comes with high overhead. C-Miner only runs offline due to its overhead. QuickMine improves on the issue by tagging each application I/O block request with a context identifier corresponding to the higher level application context (e.g., a web interaction, database transaction, etc.). The tag enables the request sequence to be split before mining, thus making computation overhead manageable. The key novelty of QuickMine lies in detecting and leveraging block correlations within logical application contexts. Nevertheless, it depends on explicit contextual hints from applications, which makes it hard to deploy and impractical for legacy systems.

Current history-based prefetching approaches may capture complex access patterns, but require either explicit contextual information from applications or suffer from high runtime overheads.

Temporal block associations should be exploited. Block associations are common in storage systems [18]. Sequential prefetching aims to exploit spatially associated blocks, yet temporal associations are equally important for prefetching. Lacking a fast history-based approach, our goal in this paper is thus to efficiently find temporally associated blocks. Table 1 shows the main algorithms for comparison.

3 Data Mining Techniques

In search for an approach to efficiently gather history for cache requests to improve on prefetching, we survey relevant problems from the data mining literature before describing our approach.

3.1 Sporadic Association Rule Mining

Frequent itemset mining aims to discover which items co-occur frequently in a transaction database. In this field, a group of items is called an itemset, and the number of transactions containing this itemset in the database is called support. Suppose we have a transaction database. We say an itemset is frequent if its support is larger than or equal to some threshold, minimum support .

Association rule mining is the discovery of a relationship between items and in a frequent itemset discovered from the previous step. We say if the probability of appearing given is above a threshold.

Sporadic association rule mining focuses on associations composed of mid-frequency items. It usually consists of three steps. In the first step, frequent itemsets are generated like before. The following step filters out highly frequent itemsets, which are defined as those appearing more than maximum support times; and the frequent itemsets left are called sporadic frequent itemsets. In the third step, association rule mining is used to generate association rules from the sporadic frequent itemsets. By definition, only mid-frequency itemsets and association rules are discovered during the process [13].

3.2 Generalizing to Block Associations

Let be a sequence of cache block I/O requests. In order to conduct effective prefetching, we need to identify pairs of requests that are likely to co-occur but not too frequently to be captured by the underlying cache. Notice the similarity to sporadic association rule mining: both try to find related items that appear close by and have mid-range frequency.

To discover such an association, the basic idea is to apply an existing available sporadic association rule mining algorithm [16]. However, there are several challenges. A typical storage system can serve up to billions of requests per day, resulting in an unmanageably long request . In order to conduct sporadic association rule mining on the data, we need to transform the request sequence into a transaction database as the first step.

The first difficulty is determining how to split into transactions. One approach is to split according to wall clock time, for example, splitting requests into transactions every five seconds. Another approach is to split using some fixed number of requests per transaction, e.g., group every 20 requests into a transaction. However, both approaches result in information loss, because no evidence indicates that two requests separated in different transactions are not associated. Recall that only items in the same transaction can be discovered as frequent itemsets and as being potentially associated. To address this problem, Soundararajan’s approach [23] using a context given by an application to split the sequence is effective but requires changes to the underlying system to obtain such hints, which sacrifices the generality for which Mithril is designed.

The second difficulty comes from the high time and space complexity of the currently available sporadic association rule mining algorithms. Koh [16] proposed an optimization for mining sporadic association rules using Apriori-Inverse. Their algorithm, however, still requires two phases: mining all sporadically frequent itemsets and discovering sporadic association rules. Although the algorithm avoids generating and storing highly frequent itemsets, Apriori-Inverse still needs to store and count all possible associated pairs at significant computation and storage overheads, as confirmed using the SPMF library[5].

To efficiently discover associations between requests without requiring extra application-level hints, we propose the Mithril prefetching layer, whose algorithm provides a fast approximation to sporadic association rule mining.

4 Design of Mithril

Mithril is a prefetching layer between the existing caching layer and the backend, as shown in Figure 1. Without Mithril, when a request arrives, it first touches the caching layer; if it is a cache hit, it returns directly from the cache, otherwise, as a cache miss, the application or caching layer needs to go to the backend to fetch the item. When Mithril is added, when a request arrives, Mithril records the request for mining, checks the potential prefetching list, and sends the request(s) to the caching layer for prefetching.

Figure 1: Schematic of the Mithril prefetching layer.

4.1 Mithril Mining

We now describe the algorithm at the core of our prefetching layer. Let be a sequence of unique block I/O addresses where a request has a logical time-stamp of , also known as its reference number. Let be an matrix for , where row corresponds to request , and the cells of each row contain a sorted list of increasing time-stamps. In addition, is also sorted by the first time-stamp of each block. Figure 2 illustrates the request sequence and corresponding time-stamp matrix (all the symbols are listed in Table 2).

Symbol Meaning
Time-stamp Matrix
Minimum Support
Maximum Support
Lookahead range
Maximum Metadata Size
Prefetching List Size
Table 2: Symbols used in the text

An associated block pair refers to two blocks that are repeatedly accessed in sequence. In modern systems, due to multiple applications interleaving with each other, two consecutive accesses from the same stream may not appear consecutive in the final stream, so we define a lookahead range that specifies the maximum allowed distance between two associated blocks. In order to establish an association between two blocks, not only do they need to appear within of each other, but also they need to appear with some minimum frequency. We denote this threshold as minimum support . Since our prefetching layer assumes the presence of a cache to catch frequent items, we specify maximum support as the upper bound for items to be considered for mining within a certain time interval. We remark that each of these requirements have conceptual counterparts in sporadic association rule mining.

To further distinguish associated block pairs, as illustrated in Fig 2, we define two blocks as being weakly associated if each time-stamp pair of the two blocks is within ; furthermore, if a weakly associated pair is accessed strictly consecutively (time-stamp difference 1) at least once, we define it as being strongly associated.

The reason for distinguishing weakly associated pairs and strongly associated pairs is that two blocks in a strongly-associated pair are more likely to be related, which is preferred for prefetching. However, due to multiple applications interleaving, a strong association does not always exist for each block, while there might be multiple weakly associated pairs. Therefore, only a strongly associated pair and the closest weakly associated pair are considered.

Figure 2: Illustration of mining procedure. If input is a request sequence, convert it into time-stamp matrix . Blocks that have fewer than time-stamps(ts) or more than time-stamps are not considered for mining. For each two-block pair, if they have different numbers of time-stamps, or the difference between at least one time-stamp pair is greater than , they are not associated. If all time-stamp pairs are within , they are weakly-associated. Furthermore if they have at least one time-stamp pair with difference 1, they are strongly-associated.

We present the basic version of Mithril in Algorithm 2. The function (Algorithm 2) receives two rows from as input and checks whether the corresponding two blocks are weakly or strongly associated or not.

Algorithm 2 shows the mining procedure, which uses O() time to discover associated block pairs. is the number of unique blocks requested during the recording interval. The input of the algorithm can be the request sequence or the time-stamp matrix . If the input is , then we need to first convert it into in O() time.

In the outer loop, we iterate through all rows in . For each block , we check all other blocks in the inner loop to find that are either strongly associated or are the first weakly associated occurrence. Because is sorted by first time-stamp of each block, so at inner loop at most blocks are checked. Typically, the number of blocks checked is much less than .

After an associated block pair is unveiled, it is stored in the prefetching table, which is checked for prefetching upon each request.

{algorithm}

[!htb] \DontPrintSemicolon\KwInRows and from time-stamp matrix , associationType , lookahead range \KwResultWhether and are associated

\If \ReturnFalse  \For \KwTo \If \ReturnFalse  \If \If \ReturnTrue  \ElseIf \Return

checkAssociation

{algorithm}

[!htb] \DontPrintSemicolon\KwIntime-stamp matrix , minimum support , lookahead range \KwResultAssociated block pairs \Fori=1 \KwTo len(T)-1 \If continue  \Forj=i+1 \KwTo len(T) \IfcheckAssociation([], [], ) , )  \If break

Mithril mining procedure

4.2 Optimizations

When Mithril is run, a two-dimensional time-stamp matrix is initialized. For each new request, if it is found in , the current time-stamp is appended to the corresponding row. Otherwise, the request is recorded in a new row. We append the time-stamp to a row. When the row is full, the block is considered frequent and deleted from the matrix and recorded in the frequent block hashmap. Items from this hashmap are ignored when encountered again before the mining process. When the time-stamp matrix is full, the mining procedure is called and the associated blocks are saved in the prefetching table. After mining completes, recording starts anew with a clean state.

The version of Mithril described so far requires a large matrix with maximum support columns for storing time-stamps, a hashmap mapping from block number to the corresponding row in the matrix and a hashmap for determining whether a block is frequent. Additionally, a prefetching table is needed for storing associated block pairs for prefetching. However, spending limited cache space on tracking large metadata is not desirable. To address the metadata space usage of basic Mithril, we made the following optimizations, which use bounded memory in exchange for some added complexity.

Recording and Mining

Splitting recording table. The two-dimensional recording table (time-stamp matrix) is a sparse matrix, since a typical block, by definition, will be requested fewer than maximum support times within a recording period. A naïve implementation uses a linked list for each block instead of a fixed-size array. However, the space for link pointers between timestamp nodes doubles the space overhead. We exploit the sparsity by decomposing the large matrix into two smaller fixed-sized tables: one with minimum support columns, which is the recording table, and the other one with maximum support columns, which we call the mining table. The recording table is a circular array in which new entries replace old entries in FIFO fashion. The mining table is a fixed-size array that triggers the mining procedure when full.

When a block request arrives, the time-stamp is recorded in the recording table. If the number of time-stamps in the corresponding row has reached minimum support , in other words, when the row is full, it is declared to be mining ready and then transferred into the mining table, which can store up to time-stamps for each block. After migrating one row from the recording table to the mining table, the last row in the recording table is moved up to the migrated row to make the table compact. When the mining table is full, the mining procedure is triggered to discover associated block pairs and store them in the prefetching table for prefetching. When the mining finishes, the mining table is cleared. When the recording table is full, we replace the oldest entry with a new entry with the assumption that the oldest block remaining in the table is rare since it has not been requested times within the interval.

Decomposing the original matrix not only saves space but also allows for more blocks to be tracked. Because the recording table does not need to be cleared each time, we retain extra information for blocks that are not mining ready. In the unoptimized approach, the large time-stamp matrix was cleared each time the mining finishes, discarding all information.

The primary drawback of splitting is that the mining table needs to be sorted before mining. This is because Algorithm 2 requires input to be sorted by the first time-stamp, which occurs automatically in our single-table construction. Since our separate mining table is created by inserting elements in the order of accumulating time-stamps, sorting the mining table before mining is necessary. In practice, however, the mining table is usually tiny and sorting is trivial.

Compressing time-stamps. To further reduce the space used by the recording table and the mining table, we compress time-stamps by storing only the lower 15 bits. This allows us to store four time-stamps in the lower 60 bits of one 64-bit integer with a time-stamp counter stored in the higher 4 bits. Moreover, one could further compress time-stamps by removing the last bits – we omitted this optimization in our experiments to limit time overhead.

Removing the frequent block hashmap. A block that is requested more than times in each recording interval in the original Mithril approach is considered to be a frequent block, so no information should be recorded. To track the requests, one could use a hashmap or Bloom filter, but hashmaps require extra memory and Bloom filters incur extra computation overhead. Instead, we decide to record a block only on cache miss. In this way, all frequent blocks are automatically filtered out by the underlying cache. There are several other benefits. First, Mithril need not be invoked when cache sizes are sufficiently large and minimum support is greater than 1. This behavior happens gradually over larger cache sizes since the mining phase will be run less frequently. Second, if a block is accessed frequently over a short period, the optimized recording method cuts down overhead since it only records cache misses, thus precluding spuriously recording frequently accessed blocks. If the cache size is small, recording bursts and thus prefetching frequent items is useful since these blocks are constantly being evicted by the underlying cache.

Our optimizations trade off storage, computation overhead, prediction precision and hit-ratio improvement. The more useful information we record, the higher hit rate and precision can be achieved, but at the same time more overhead is incurred. Besides recording at cache miss as mentioned above, optionally we can also record the time-stamp when a block is evicted from the cache to obtain more information about the block. Recording at eviction is similar to recording at cache miss: in both approaches, the frequent blocks are filtered out by the underlying cache.

Prefetching

Splitting the prefetching table into shards. We use a two-dimensional array instead of lists to store associated block pairs together for storage reduction for the same reason as using an array in the recording table. In the prefetching table, the first column stores the originated block number , while the rest of the columns store the blocks that are associated with . The number of columns left is the maximum number of possible block pairs, defined as prefetching list size . We use a default of three columns, indicating that, at most two block pairs can be stored for each block. For example, in an association , is stored in the first column and is stored in the second column. If there is another association, , then the third column stores . If more than two associations are discovered, we replace the old associations in a FIFO manner, which allows Mithril to adapt to changing workloads.

Since cache behavior varies in different workloads, it is impossible to know how many blocks will have associations ahead of time, and thus how much memory will be needed. Therefore, we introduce the concept of shards. A shard is a prefetching table with 2000 rows that is dynamically allocated when needed. When a user specifies a maximum metadata size can be used for Mithril, an upper bound is placed on the number of possible shards. When all possible shards are allocated, a new row will replace the oldest row.

By introducing shards, we aim to find a balance between frequent allocation and overallocation of memory. In addition to saving metadata memory usage, the maximum memory usage is also bounded by maximum metadata size .

Since prefetched blocks are also added to the original cache pool, it is possible for a prefetched block to be evicted before it is used. As other authors suggest [8, 6], we give the prefetched block a second chance by re-adding it to the MRU end of cache if it is going to be evicted without being accessed.

4.3 Using Mithril

Using Mithril as a prefetching layer requires minor modifications to the underlying caching layer. The complete flow of Mithril is shown in Algorithm 4.3. A prefetch from Mithril requires passing one parameter and two indicators. The parameter is the current block number, which is used for recording, prefetching or both. The two indicators are pFlag and rFlag, which indicates whether it is for recording or prefetching.

There are two scenarios where the Mithril API may be called. First, when a request arrives, Mithril must check whether prefetching is needed. In this situation, and . Second, to handle recording when and . This recording may be invoked (a) at the arrival of each request, (b) only at cache misses, (c) only during cache eviction, or (d) during both misses and eviction. Recording at each request or recording at both misses and evictions increases the computation overhead. As we demonstrate in Section 5.4, recording on the arrival of each request optimizes performance, whereas recording only at cache misses provides similar performance at much lower overhead. In contrast, we find the two approaches (c, d) recording on eviction do not to provide competitive performance.

{algorithm}

[!t] \DontPrintSemicolon\KwInrecording table , mining table , prefetching table , minimum support , block# , prefetchingFlag , recordingFlag \KwOutblocks to prefetch

\If

[]   append to \If move to   move last row in to \If is full mining()  clear \If \If \Return[]  \ReturnNULL (no need to prefetch) Description of the Mithril algo.

4.4 Complexity Analysis

Time complexity. Compared to LRU, the only operations added to each request are to record the current logical time-stamps in the recording table on a cache miss and check the prefetching table and prefetch when needed. Each of these operations has a time complexity of , so the total computation overhead at each request is negligible. Periodically, the mining procedure runs and is dominated by an sort, where is a fixed, typically small table size. The mining process can furthermore be run in a background thread and thus avoid blocking new requests.

Space complexity. In the optimized Mithril, we store all time-stamps as 15-bit integers with four time-stamps in one 64-bit integer. Thus if we have maximum support =8, minimum support =4, recording table size 100,000 and mining table size 1,250, recording and mining will need less than 2MiB. When calculating size of hashtable, which maps from block address to index in recording table or mining table, the 8 byte is used for storing block address, the 4 bytes is used for storing the index.

Since all information is stored in a bounded array, the maximum metadata size allocated is usually set to 10%, which is more than enough in most cases. And in our evaluation, we count in the memory usage for all metadata for fair comparison, which means when Mithril metadata uses 5% of cache space, then only 95% of space will be used for store cache data.

5 Evaluation

We now characterize Mithril experimentally with the following questions in mind:

  • How much does Mithril improve the hit ratio? What are the best and worst cases?

  • How well does Mithril work with various cache replacement algorithms, and how precise is prefetching?

  • How do parameters affect Mithril?

  • Is latency improvement enough to justify overhead?

  • Why does Mithril work?

5.1 Methodology

As a history-based prefetching layer, ideally we should compare Mithril with C-Miner [18] and QuickMine [23], which are the two state-of-the-art algorithms in history-based prefetching. However, since C-Miner and QuickMine either runs offline or requires context information from the application, which is not applicable in our setting. So instead we implemented another history-based prefetching technique, Probability Graph (PG) [10], together with a state-of-the-art sequential prefetching algorithm, Amp [6], and LRU to compare to Mithril. Note that Mithril can be used on top of Amp.

We evaluated algorithms on 106 traces from commercial I/O caching analytics services from CloudPhysics (CP) [26] together with 29 traces obtained by Microsoft Research (MSR) [22] (We omitted traces that have fewer than a half million requests). For simulation-based results, we used the mimircache [28] for profiling and analysis on a Microway server of dual E5-2670v3 CPUs with 512GB memory. For the micro benchmark, we modified IOBlazer [3] and ran it on AWS EC2 c3.large instance with an EBS magnetic disk. In this section, if not specified, Mithril is used together with LRU, and all experiments showing single trace used trace w94 from CP [26], which is a week-long VM trace. The cache size, if not mentioned, is set to 256MB, which exhibits a range of LRU hit ratios between 10% to 99%. The profiling platform and Mithril implementation will be released under open-source after publication [28]. The CP data used in the paper will be released by CloudPhysics separately.

Figure 3: Hit ratio of PG and Mithril for 106 CP traces and 29 MSR traces sorted by PG hit ratio. Hit ratio of LRU omitted as it is similar to PG (Pearson compared to for LRU and Mithril). Compared to PG, Mithril overall provides significant improvement, even though parameters are not fine-tuned for each trace.
{subfigure}

0.48 {subfigure}0.48

Figure 4: Left: Hit ratio of Amp and Mithril-LRU, right: Hit ratio of Amp and Mithril-AMP for CP and MSR traces sorted by Amp hit ratio. Left: Mithril-LRU outperforms Amp in most traces. For some traces with strong sequentiality, Amp has better performance due to its ability to prefetch pages that have never been requested. Right: Mithril-AMP improves or matches hit ratio for most traces compared to Amp.

5.2 Overall Hit Ratio Improvement

As a prefetching layer unaware of the underlying caching layer, which can be either FIFO, LRU, Amp or any other cache replacement algorithms, In this section, we show that Mithril provides benefits for all of them.

Comparison with PG. PG is the most comparable history-based algorithm, so we compare Mithril with PG in this section. In Figure 3, we show the hit ratio of PG and Mithril for all the traces. LRU is not shown in the trace because of its high resemblance to PG in terms of average hit ratio and correlation: the Pearson Correlation Coefficient between hit ratio of LRU and PG is 0.993, while it is 0.801 between LRU and Mithril. The low correlation between LRU and Mithril implies that the performance of Mithril does not completely depend on the performance of LRU. Compared to LRU, on average Mithril provides 52% relative improvement in the hit ratio on CP traces, and on average achieves 82% of the maximum obtainable hit ratio at small cache size, which is calculated by excluding cold miss. On the 29 MSR traces, Mithril provides on average a 64% hit ratio improvement achieving 81% of the maximum obtainable hit ratio. As shown in the figure, the hit ratio improvement for Mithril varies between traces. For certain traces, it can provide up to more than 7 improvement, but for some other traces, the improvement is more modest, particularly those whose PG hit ratio is already high.

Comparison with Amp. As a prefetching layer, we also compare Mithril with state-of-the-art sequential prefetching algorithm Amp, which dynamically captures the spatial associations in the requests. Compared to Amp, Mithril on average provides a 31% increase in hit ratio on CP traces and 51% on MSR traces, indicating that by exploring temporal associations, Mithril can provide more benefit than Amp. However, as shown in Figure 4, Mithril does not always provide more benefit compared to Amp. In some traces where sequentiality is not dominant, Mithril provides a great benefit, more than a 7 improvement on hit ratio; in some other traces where sequentiality dominates the disk access pattern, Amp provides more benefit than Mithril. The reason Amp outperforms Mithril lies in its ability to prefetch blocks that have never been requested. In contrast, Mithril does not have this ability. It can only prefetch blocks already seen in the past.

Although Amp surpasses Mithril in some cases, Mithril as a prefetching layer can be used on top of Amp. In Figure 4, we show the hit ratio obtained by Amp compared to Mithril-AMP. Using Mithril on top of Amp guarantees at least similar performance as Amp, and still provides a large benefit on most of the traces. This improvement implies that besides spatial-locality, which has been captured by Amp, Mithril is capable of further leveraging the temporal-locality associations between requests to gain performance promotion. Note that Figure 3 and Figure 4 cannot be directly compared, because former one is sorted by PG, and latter one is sorted by Amp. However, Figure 4 and Figure 4 are comparable since curves in both figures are sorted by the Amp hit ratios. Adding Mithril to Amp guarantees no performance loss compared to Amp, however, Mithril-AMP does not guarantee a better performance than Mithril-LRU as we see in some of the traces. The reason Mithril-LRU can be better than Mithril-AMP is that Amp turns some cache misses into cache hits due to its sequential prefetching ability. Thus the relationship seen by Mithril is jeopardized, and the associations captured by Mithril can be inaccurate. Overall, Mithril significantly improves hit ratio over PG and Amp.

Behavior on representative traces. To better illustrate the hit ratio improvement, we select six traces (three from CP and three from MSR) to show typical examples of large (top two), modest (middle two) and small (bottom two) performance gains for Mithril in Figure 11. The top two traces show the cases where Mithril outperforms the corresponding caching algorithm by at least doubling the hit ratio. The middle two figures show the traces that have relatively high hit ratios under LRU. Adding Mithril provides a modest performance improvement. In the bottom two traces, Amp outperforms Mithril-LRU by being able to prefetch unseen blocks. However, this can be changed by using Mithril with Amp. Still, in these cases, Mithril-AMP usually does not win over Amp much in terms of hit ratio because the hit ratios are often already high, limiting potential benefit. Mithril can also only prefetch blocks that have already been seen, capping the maximum hit rate at . PG is the only prefetching algorithm in same category as Mithril. Its performance is unstable, sometimes better than Amp, most of time worse than Amp. For most traces, it outperforms pure LRU and is outdone by Mithril.

Mithril is compatible with a range of caching algorithms. The figures compare performance of using Mithril on top of LRU, FIFO and Amp to that of the original cache replacement algorithms. Adding Mithril consistently boosts hit ratio, particularly for simpler cache replacement algorithms. For example, by adding Mithril to FIFO, the performance of Mithril-FIFO is similar to Mithril-LRU, which is much better than FIFO. This property of Mithril opens the possibility of using Mithril with particular cache replacement algorithms in appropriate situations, for instance when run off of SSDs [24], Mithril with FIFO may achieve the best performance.

{subfigure}

0.495 {subfigure}0.495 {subfigure}0.495 {subfigure}0.495 {subfigure}0.495 {subfigure}0.495

Figure 5: CP trace w94
Figure 6: MSR trace prxy
Figure 7: CP trace w91
Figure 8: MSR trace src1
Figure 9: CP trace w89
Figure 10: MSR trace proj
Figure 11: Hit ratio of different algorithms. Example traces where Mithril significantly improves hit rate (top two), where Mithril shows modest improvement (middle two), and where Mithril shows little or no performance gain (bottom two).

5.3 Cache Size and Precision

The Mithril prefetching layer can accommodates most cache replacement algorithms. To focus the discussion, we will hereby focus only on LRU and Mithril-LRU.

Our results so far are based on performance at a single cache size. We now show the performance of Mithril under a range of cache sizes. Figure 12 shows the hit ratio curve (HRC) of LRU, PG and Mithril along with the prefetching precision of the latter two. Shown in HRC, the performance PG is always better than LRU, and as the cache size increases, the gap between PG and LRU increases due to more space allocated for PG’s pair-wise probability matrix. However, the improvement of PG is limited due to its large matrix. In contrast, Mithril provides a hit ratio boost even at a small cache size.

The precision curve of PG has several peaks and troughs because the size of its comprehensive conditional probability matrix depends on cache size. As the cache size increases, the matrix size grows. However, precision may not benefit from the increasing probability matrix size due to wrong new predictions. Similarly, the precision curve for Mithril is also not monotonic, especially with a small cache size, due to the eviction of prefetched blocks before being requested. When comparing the prefetching precision of PG and Mithril, we see that, in most situations, Mithril has better precision than PG.

Figure 12: Hit ratio curve and prefetching precision of LRU, PG and Mithril Left: Mithril outperforms LRU and PG. Right: The prefetching precision of Mithril is higher than PG and both two curves are not monotonic.

5.4 Effects of Parameters

Mithril uses several parameters that now investigate in isolation in terms of impact on hit ratio and prefetching precision using a representative CP trace (w94).

Maximum support decides the maximum allowed degree of hotness of a block. This is decided by considering the row length of the mining table. If a block is requested more than times before mining, it gets kicked out as a frequent block. As shown in Figure 19, has a small effect on hit ratio and prefetching precision since most of the frequent blocks are already filtered out by an underlying caching layer. Recall that Mithril records blocks only during cache misses.

Lookahead range decides the maximum allowed timestamp difference for two blocks to be considered associated. It is obvious that should be a parameter related to the number of concurrent running processes. If too large, non-associated block pairs will be mistaken as associated, thus increasing the false positive rate. On the other hand, being too small will result in many associations being ignored and thus a high false negative rate. As shown in Figure 19, when is small, as increases, the hit ratio increases substantially, while prefetching precision decreases slightly. After certain threshold, further increasing will not increase hit ratio. This is because the best should relate to the number of concurrent running applications, the given trace shown in the figure has its best around 50.

Prefetching list size determines the space that can be used for storing associated blocks, which is the row length of the prefetching table. Recall that when more than associated blocks are discovered, the old blocks are replaced in a FIFO manner. Figure 19 shows that increasing dramatically reduces prefetching precision because a large means stale associations are also stored for prefetching. On the other hand, the hit ratio first increases and then decreases with an increasing . We notice that setting as 2 gives an acceptable trade-off between hit ratio and precision across the various datasets we considered.

Maximum metadata size decides the maximum space Mithril can use for the recording table, mining table and prefetching table. As illustrated in Figure 19, if is too small, there are not enough spaces for the prefetching table, dramatically reducing the effect of Mithril. After a threshold, further increasing won’t increase the hit ratio. However, setting too large in situations that Mithril does not have good performance will waste space which should be used for caching. We thus recommend a default value of 10% of the entire cache space based on traces we have tested.

Minimum support has the largest effect on the performance of Mithril. It decides when a request is ready for mining and is the row length of the recording table. In Figure 19, we can see that increasing will increase prefetching precision, while reducing the hit ratio. Two requests are required to appear closely times to be considered associated, and when we have a larger , the requirement for being associated is stricter, which diminishes the number of associations and grows the confidence of discovered associations.

Different recording locations also have a large effect on the performance of Mithril. As mentioned in Section 4, we record only at cache misses, which reduces computation by recording only the most important information. As shown in Figure 19, besides recording a) at cache miss, we can also record b) when a block is evicted from cache, c) at cache miss and eviction, or d) each time a request arrives. Using c) and d) usually give more information to Mithril at a cost of more computation. In other words, we can trade CPU cycles for potentially better hit ratio and precision. As we observe across the traces, recording at evictions (b) usually cannot provide good performance; recording at evictions and misses (c) occasionally provides similar performance to the other two approaches a and d, but most of the time only slightly better than recording at evictions (b). In contrast, recording at the arrival of each request (d) usually gives the best performance with the highest precision. As an alternative, recording at cache misses (a) can greatly reduce the overhead of Mithril, while, as we have evaluated in most traces, it provides less than a 10% performance loss compared to recording at each request.

{subfigure}

0.495 {subfigure}0.495 {subfigure}0.495 {subfigure}0.495 {subfigure}0.495 {subfigure}0.495

Figure 13: maximum support
Figure 14: lookahead range
Figure 15: prefetching list size
Figure 16: maximum metadata size
Figure 17: minimum support
Figure 18: effect of recording
Figure 19: Effect of parameters in Mithril.

5.5 Real System Performance

{subfigure}

[width=] {subfigure}[width=]

Figure 20: Latency and CPU usage of using no cache, LRU, Amp and Mithril-LRU. On the top, each latency point is the average latency of 40000 requests. At the bottom, it shows the relatively increased CPU usage of Mithril due to mining and prefetching, compared to LRU and Amp, the increase is less than 1%.

Latency. A high hit ratio may not mean low latency in a real system because of factors such as CPU overhead and late prefetch. Especially for a history-based prefetching, the cost of prefetching a random block is large. In Figure 20, we justify the overhead compared with benefit. It shows the latency of four approaches on CP trace w94: using no cache, using LRU cache, using Amp and using LRU cache with a Mithril prefetching layer. Compared to no cache, LRU reduces average latency by more than 26%, especially at the peaks, where the no-cache system shows a high latency. Using a sequential prefetcher Amp, the latency further decreases by 32% over LRU on average, whereas Mithril with LRU reduced latency by 52% over LRU.

Late prefetches. Although latency reduction due to Mithril prefetching is evident, we also see that 22.4% of prefetches are late, which means the arrival of prefetched blocks happen after the time they are requested. Late prefetches affect the performance of Mithril by wasting one disk read unless caught by the disk controller.

Mithril warm up time. In Figure 20, focusing on the first 5% percent of the requests in a system with Mithril, we can see the there is no latency reduction at beginning, and latency decrease as time goes from 0% to 10%. The decrease occurs because Mithril needs sufficiently many requests for warm-up before it conducts mining and prefetching.

Existence of latency peaks. Mithril does not eliminate all latency peaks. The peaks stem chiefly from two phenomena: they are due to long disk rotational latency or a burst of requests, or a mix of these aspects. When the peaks occur due to long disk rotational latency, Mithril can effectively reduce latency by prefetching. One extreme case would be if each block request demands the disk to rotate half way to retrieve the content, causing peaks in a system without Mithril. However, in systems with Mithril, associations between these requests would be unveiled and harnessed. In other words, Mithril would prefetch associated block into the cache ahead of its request time, thus lowering latency. On the other hand, if the latency peak is caused by a large number of outstanding I/Os [12], Mithril provides less benefit because issuing prefetches only increases the burden on the disk. Consequently, not all latency peaks can be removed by Mithril.

CPU usage. Mithril is based on approximate association mining, which might be CPU-intensive. As shown in the figure, we see some CPU consumption increase for Mithril, however, the increase is minor and within the limits afforded by many storage systems.

5.6 Mithril Analysis

In this section, we analyze the behavior of Mithril underlying its performance. Figure 24 shows the associations discovered by Mithril after a full trace run. Both horizontal and vertical axes are logical block addresses (LBA): if two blocks and are associated, a dot is placed at point (, ). The association plot clearly shows that Mithril not only discovers sequential block associations, denoted by the diagonal in the graph, but also many non-sequential block associations.

As mentioned earlier, Mithril is designed to catch the mid-frequency blocks since frequent blocks are captured by the underlying caching layer and rare blocks are by definition not worth chasing after. Figure 24 and Figure 24 show the hit count obtained by LRU and Mithril; the horizontal axis is sorted by the frequency of blocks in the original trace. LRU gets cache hits on most of the frequent blocks (left part of the figure). For mid or low frequency blocks, LRU shows a bushy image because whether LRU can catch a mid or low frequency block depends on if the block shows small-range locality. If a block shows small-range locality, it can be caught by LRU. For example, if a block is accessed only twice throughout the trace and the two accesses are just separated by a few requests, then it will be captured by LRU. However, if its two accesses are far away from each other, then it won’t be captured by LRU. For Mithril, besides high-frequency blocks being captured, mid-frequency blocks can also be captured because Mithril can predict its access ahead of time. As shown in the figure, Mithril has higher hit counts for most blocks in the mid-frequency range. These two figures illustrate the crux of why Mithril provides a high hit ratio: it discovers sequential associations and non-sequential associations, capturing the mid-frequency blocks that tend to be ignored by common cache replacement policies.

{subfigure}

0.38 {subfigure}0.48 {subfigure}0.48

Figure 21: Associations discovered by Mithril
Figure 22: Hit count in LRU
Figure 23: Hit count in Mithril
Figure 24: Mithril Analysis. a): associations discovered by Mithril contains both sequential associations and non-sequential associations. The four rectangular areas in the figure may represent two major applications that interact with each other. b), c): hit count of blocks sorted by frequency in original trace illustrates Mithril is able to capture mid-frequency blocks, while LRU cannot.

6 Conclusion

Storage systems increasingly rely on effective caching layers to sustain mounting demands for performance. We proposed a novel general history-based prefetching layer, Mithril, to supplement the caching layers. Mithril is based purely on the logical timestamp of cache requests without any extra hints, making it easy to use and integrate into existing systems. We evaluated Mithril on 106 week-long CP traces and 29 70-day-long MSR traces of real storage systems in terms of the hit ratio. Our experimental results suggest that Mithril is lightweight compared to other history-based approaches, and provides 36% greater hit ratio over the ubiquitous Amp sequential prefetching algorithm at modest costs.

Our work opens a door for combining effective cache replacement algorithms with Mithril to create a low-overhead caching strategy for capturing often overlooked mid-frequency items and bolster cache performance in today’s cloud storage systems.

References

  1. Amer, A., Long, D. D., and Burns, R. C. Group-based management of distributed file caches. In Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on (2002), IEEE, pp. 525–534.
  2. Baker, M. G., Hartman, J. H., Kupfer, M. D., Shirriff, K. W., and Ousterhout, J. K. Measurements of a distributed file system. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles (New York, NY, USA, 1991), SOSP ’91, ACM, pp. 198–212.
  3. Bergamasco, D. Ioblazer. https://labs.vmware.com/flings/ioblazer. Accessed: 2017-01-30.
  4. Chang, F., and Gibson, G. A. Automatic i/o hint generation through speculative execution. In Proceedings of the Third Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 1999), OSDI ’99, USENIX Association, pp. 1–14.
  5. Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu., C., and Tseng, V. S. SPMF: a Java Open-Source Pattern Mining Library. Journal of Machine Learning Research (JMLR) 15 (2014), 3389–3393.
  6. Gill, B. S., and Bathen, L. A. D. AMP: adaptive multi-stream prefetching in a shared cache. In Proceedings of the 5th USENIX conference on File and Storage Technologies (2007), USENIX Association, pp. 26–26.
  7. Gill, B. S., and Bathen, L. A. D. Optimal multistream sequential prefetching in a shared cache. Trans. Storage 3, 3 (Oct. 2007).
  8. Gill, B. S., and Modha, D. S. SARC: Sequential prefetching in adaptive replacement cache. In USENIX Annual Technical Conference, General Track (2005), pp. 293–308.
  9. Gniady, C., Butt, A. R., and Hu, Y. C. Program-counter-based pattern classification in buffer caching. In 6th Symp. Operating Systems Design & Implementation (OSDI) (Dec 2004), pp. 395–408.
  10. Griffioen, J., and Appleton, R. Reducing file system latency using a predictive approach. In USENIX summer (1994), pp. 197–207.
  11. Gu, P., Zhu, Y., Jiang, H., and Wang, J. Nexus: a novel weighted-graph-based prefetching algorithm for metadata servers in petabyte-scale storage systems. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on (2006), vol. 1, IEEE, pp. 8–pp.
  12. Gulati, A., Shanmuganathan, G., Ahmad, I., Waldspurger, C., and Uysal, M. Pesto: Online storage performance management in virtualized datacenters. In Proceedings of the 2Nd ACM Symposium on Cloud Computing (New York, NY, USA, 2011), SOCC ’11, ACM, pp. 19:1–19:14.
  13. Han, J., Pei, J., and Kamber, M. Data mining: concepts and techniques. Elsevier, 2011.
  14. Jiang, S., Ding, X., Xu, Y., and Davis, K. A prefetching scheme exploiting both data layout and access history on disk. ACM Transactions on Storage (TOS) 9, 3 (2013), 10.
  15. Jiang, S., and Zhang, X. Lirs: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, NY, USA, 2002), SIGMETRICS ’02, ACM, pp. 31–42.
  16. Koh, Y. S., and Rountree, N. Finding sporadic rules using apriori-inverse. In Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (Berlin, Heidelberg, 2005), PAKDD’05, Springer-Verlag, pp. 97–106.
  17. Li, M., Varki, E., Bhatia, S., and Merchant, A. TaP: table-based prefetching for storage caches. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (2008), USENIX Association, p. 6.
  18. Li, Z., Chen, Z., Srinivasan, S. M., and Zhou, Y. C-Miner: Mining block correlations in storage systems. In FAST (2004), vol. 4, pp. 173–186.
  19. Li, Z., Chen, Z., and Zhou, Y. Mining block correlations to improve storage performance. ACM Transactions on Storage (TOS) 1, 2 (2005), 213–245.
  20. McKusick, M. K., Joy, W. N., Leffler, S. J., and Fabry, R. S. A fast file system for unix. ACM Trans. Comput. Syst. 2, 3 (Aug. 1984), 181–197.
  21. Megiddo, N., and Modha, D. S. Arc: A self-tuning, low overhead replacement cache. In Proceedings of the 2Nd USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2003), FAST ’03, USENIX Association, pp. 115–130.
  22. Narayanan, D., Donnelly, A., and Rowstron, A. Write off-loading: Practical power management for enterprise storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2008), FAST’08, USENIX Association, pp. 17:1–17:15.
  23. Soundararajan, G., Mihailescu, M., and Amza, C. Context-aware prefetching at the storage server. In USENIX Annual Technical Conference (2008), pp. 377–390.
  24. Tang, L., Huang, Q., Lloyd, W., Kumar, S., and Li, K. Ripq: Advanced photo caching on flash for facebook. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2015), FAST’15, USENIX Association, pp. 373–386.
  25. Teng, J. Z., and Gumaer, R. A. Managing ibm database 2 buffers to maximize performance. IBM Syst. J. 23, 2 (June 1984), 211–218.
  26. Waldspurger, C. A., Park, N., Garthwaite, A., and Ahmad, I. Efficient mrc construction with shards. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2015), FAST’15, USENIX Association, pp. 95–110.
  27. Wong, T. M., and Wilkes, J. My cache or yours? making storage more exclusive. In Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2002), ATEC ’02, USENIX Association, pp. 161–175.
  28. Yang, J. mimircache. https://github.com/1a1a11a/mimircache. Accessed: 2017-01-30.
  29. Yang, S., Srinivasan, K., Udayashankar, K., Krishnan, S., Feng, J., Zhang, Y., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Tombolo: Performance enhancements for cloud storage gateways. In IEEE 32nd Symposium on Mass Storage Systems and Technologies, MSST 2016 (2016).
  30. Zhou, Y., Philbin, J., and Li, K. The multi-queue replacement algorithm for second level buffer caches. In Proceedings of the General Track: 2001 USENIX Annual Technical Conference (Berkeley, CA, USA, 2001), USENIX Association, pp. 91–104.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
145166
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description