Pangolin: A Fault-Tolerant Persistent Memory Programming Library
Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-based data structures that can offer substantial performance gains over conventional approaches to managing persistent state. This programming model removes the file system from the critical path which improves performance, but it also places these data structures out of reach of file system-based fault tolerance mechanisms (e.g., block-based checksums or erasure coding). Without fault-tolerance, using NVMM to hold critical data will be much less attractive.
This paper presents Pangolin, a fault-tolerant persistent object library designed for NVMM. Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application’s objects from both media errors and corruption due to software bugs. It provides these protections for objects of any size and supports automatic detection of data corruption and recovery. The required storage overhead is small (1% for gigabyte pools of NVMM). Pangolin provides stronger protection, incurs orders of magnitude less storage overhead, and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.
Pangolin: A Fault-Tolerant Persistent Memory Programming Library
|University of California, San Diego and Steven Swanson|
|University of California, San Diego|
Emerging non-volatile memory (NVM) technologies (e.g., battery-backed NVDIMMs  and 3D XPoint ) provide persistence with performance comparable to DRAM. Non-volatile main memory (NVMM), is byte-addressable, cache-coherent NVM operating on the system’s main memory bus. The combination of NVMM and DRAM will enable hybrid memory systems that offer the promise of dramatic increases in storage performance.
A key feature of NVMM is support for direct access, or DAX, that lets applications perform loads and stores directly on the data of a file residing in NVMM. DAX offers the lowest-possible storage access latency and enables programmers to craft complex, customized data structures for specific applications. To support this use case, researchers and industry have proposed various persistent object systems [3, 42, 35, 22, 12, 24, 18, 5].
Building persistent data structures presents a host of challenges, particularly in the area of crash consistency and fault tolerance. Systems that use NVMM must preserve crash-consistency in the presence of volatile caches, out-of-order execution, software bugs, and system failures. To address these challenges, many groups have proposed crash-consistency solutions based on hardware [30, 36, 29, 32], file systems [4, 10, 43, 44], user-space data structures and libraries [45, 39, 3, 42, 35, 12, 5], and languages [33, 9].
Fault tolerance has received less attention but is equally important: To be viable as an enterprise-ready storage medium, persistent data structures must include protection from data corruption. Intel processors report uncorrectable memory media errors via a machine-check exception and the kernel forwards it to user-space as a SIGBUS signal. To our knowledge, Xu et al.  were the first to design an NVMM file system that detects and attempts to recover from these errors. Among programming libraries, only libpmemobj provides any support for fault tolerance, but it incurs 100% space overhead, only protects against media errors (not software “scribbles”), and cannot recover corrupted data without taking the object store offline.
Xu et al. also highlighted a fundamental conflict between DAX-mmap() and file system-based fault tolerance: By design DAX-mmap() leaves the file system unaware of updates made to the file, making it impossible for the file system to update the redundancy data for the file. Their solution is to disable file data protection while the file is mapped and restore it afterward. This provides well-defined protection guarantees but leaves file data unprotected when it is in use.
Moving fault-tolerance to user-space NVMM libraries solves this problem, but presents challenges since it requires integrating fault tolerance into persistent object libraries that manage potentially millions of small, heterogeneous objects.
To satisfy the competing requirements placed on NVMM-based, DAX-mapped object store, a fault-tolerant persistent object library should provide at least the following characteristics:
Crash-consistency The library should provide the means to ensure consistency in the face of both system failures and data corruption.
Protection against media and software errors Both types of errors are real threats to data stored to NVMM, so the library should provide a measure of protection against both.
Low storage overhead NVMM is expensive, so minimizing storage overhead that fault tolerance requires is important.
Online recovery For good availability, recovery must proceed without taking the persistent object store offline.
High performance Speed is a key benefit of NVMM. If fault-tolerance incurs a large performance penalty, NVMM will be much less attractive.
Support for diverse objects A persistent object system must support objects of size ranging from a few cache lines to many megabytes.
This paper describes Pangolin, the first persistent object library to satisfy all these criteria. Pangolin uses a combination of parity, replication, and object-level checksums to provide space-efficient, high-performance fault tolerance for complex NVMM data structures. Pangolin also introduces a new technique accessing NVMM called micro-buffering that simplifies transactions and prevents programming errors from corrupting NVMM.
We evaluate Pangolin using a suite of benchmarks and compare it to libpmemobj, a persistent object library that offers a simple replication mode for fault tolerance. Compared to libpmemobj, performance is similar, and Pangolin provides stronger protection, online recovery, and vastly reduced storage overhead (1% instead of 100%).
The rest of the paper is organized as follows: Section 2 provides a primer on NVMM programming and NVMM error handling in Linux. Section 3 describes Pangolin’s data organization, transactions, and how Pangolin detects and repairs NVMM errors. Section 4 presents our evaluations. Section 5 discusses related work. Finally, Section 6 concludes the paper.
Pangolin lets programmers build fault-tolerant, crash-consistent data structures in NVMM. This section first introduces NVMM and the DAX mechanism applications use to gain direct access to persistent data. Then, we describe the NVMM error handling mechanisms that Intel processors and Linux provide. Finally, we provide a brief primer on NVMM programming using libpmemobj , the library on which Pangolin is based.
2.1 Non-volatile Main Memory and DAX
Several technologies are poised to provide NVMM in computer systems. 3D XPoint  is the closest to wide availability, but phase change memory (PCM), resistive RAM (ReRAM), and spin-torque transfer RAM (STT-RAM) are all under active development by memory manufacturers. The flash-backed DRAM is already available and in wide use. Linux and Windows both have support for accessing NVMM and using it as storage media.
The performance and cost parameters of NVMM lie between DRAM and SSD. Its write latency is expected to be longer than DRAM, but it will cost less per bit. From the storage perspective, NVMM is faster but more expensive than SSD.
The most efficient way to access NVMM is via direct access (DAX)  memory mapping (i.e., DAX-mmap()). To use DAX-mmap(), applications map pages of a file in an NVMM-aware file system into their address space, so the application can access persistent data from the user-space using load and store instructions, without the file system intervening.
2.2 Handling NVMM Media Errors
To recover from data corruption, Pangolin relies on error detection and media management facilities that the processor and operating system together provide. Below, we describe these facilities available on Intel and Linux platforms. Windows provides similar mechanisms.
Hardware Error Correction We expect NVMM devices and their memory controllers will implement error-correction code (ECC) in hardware to detect and correct media errors as ECC-DRAMs do: The controllers detect and correct errors when they can, and they report uncorrectable (but detectable) errors with a machine check exception (MCE)  that the operating system can catch and attempt to handle.
Pangolin does not require ECC for correct operation. As we describe in Section 3, Pangolin can use checksums to detect errors without hardware ECC and can also detect corruption due to software bugs (which are invisible to hardware ECC). ECC does, however, improve performance by transparently handling many media errors.
Regardless of the ECC algorithm hardware provides, field studies of DRAM and SSDs [38, 13, 40, 37, 25, 31] have shown that detectable but uncorrectable media errors occur frequently enough to warrant additional software protection. For instance, file systems [46, 44, 20] apply checksums to their data structures to protect against scribbles.
Repairing Errors When the hardware detects an uncorrectable error, the Linux kernel marks the region surrounding the failed load as “poisoned,” and future loads from the region will fail. Pangolin assumes an error poisons a 4 KB page since Linux currently manages memory failures at page granularity.
If a running application causes an MCE (by issuing a load that fails), the kernel sends it a SIGBUS and the application can extract the affected address from the data structure describing the signal.
The software can repair the poisoned page by writing new data to the region. In response, the operating system and NVDIMM firmware together remap the poisoned addresses to functioning memory cells. The details of this process are part of the Advanced Configuration and Power Interface (ACPI)  for NVDIMMs.
2.3 NVMM Programming
In this section, we describe libpmemobj’s programming model. Libpmemobj is a well-supported, open-source C library for programming with DAX-mapped NVMM. It provides facilities for memory management and software transactions that let applications build a persistent object store. Pangolin’s interface and implementation are based on libpmemobj from PMDK v1.5111The latest release at the time of paper submission..
Linux exposes NVMM to the user-space as memory-mapped files (Figure 1). Libpmemobj (and Pangolin) refer to the mapped file as a pool of persistent objects. Each pool spans a continuous range of virtual addresses.
Within a pool, libpmemobj reserves a metadata region that contains information such as the pool’s identification (64-bit UUID) and the offset to a “root object” from which all other live objects can be reachable. Next, is an area reserved for transaction logs. Libpmemobj uses redo logging for its metadata updates and undo logging for application’s object updates. Transaction logs reside in one of two locations depending on their sizes. Small log entries live in the provisioned “Log” region, as shown in Figure 1. Large ones overflow into the “Heap” storage area.
The rest of the pool is the persistent heap. Libpmemobj’s NVMM allocator (a persistent variant of malloc/free) manages it. The allocator divides the heap’s space into several “zones” as shown in Figure 1. A zone contains metadata and a sequence of “chunks.” The allocator algorithm divides up a chunk for small objects and coalesces adjacent chunks for large objects. By default, a zone is 16 GB, and a chunk is 256 KB.
Listing 1 presents an example to highlight the key concepts of NVMM programming. The code performs two independent operations on a persistent linked list: one is to modify a node’s value, and another is to allocate and link a new node.
This example demonstrates two styles of crash-consistent NVMM programming: atomic-style (lines 3-5) for a simple modification that is no more than 8 bytes, and transactional-style (lines 7-13) for an arbitrarily-sized NVMM update.
Building data structures in NVMM using libpmemobj (or any other persistent object library) differs from conventional DRAM programming in several ways:
Memory Allocation Libpmemobj provides crash-consistent NVMM allocation and deallocation functions: pmemobj_tx_alloc/free. They let the programmer specify object type and size to allocate and prevent orphaned regions in the case of poorly-time crashes.
Addressing Scheme Persistent pointers within a pool must retain valid regardless of what virtual address the pool maps at. Libpmemobj uses a PMEMoid data structure to address an object within a pool. It consists of a 64-bit file ID and a 64-bit byte offset relative to the start of the file. The pmemobj_direct() function translates a PMEMoid into a native pointer for use in load or store instructions.
Failure-atomic Updates Modern x86 CPUs only guarantee that an 8-byte, aligned store atomically updates NVMM . If applications need larger atomic updates, they must manually construct software transactions. Libpmemobj provides undo log-based transactions. The application executes stores to NVMM in between the TX_BEGIN and TX_END macros, and snapshots (pmemobj_tx_add_range) a range of object data before modifies it in-place.
Persistence Ordering Intel CPUs provide cache flush/write-back (e.g., CLFLUSH(OPT) and CLWB) and memory ordering (e.g., SFENCE) instructions to make guarantees about when stores become persistent. In Listing 1, the pmemobj_persist function and TX macros integrate such instructions to flush modified object ranges.
Libpmemobj supports a replicated mode that requires a replica pool, doubling the storage the object store requires. Libpmemobj applies updates to both pools to keep them synchronized.
Replicated libpmemobj can detect and recover from media errors only when the object store is offline, and it cannot detect or recover from data corruption caused by errant stores to NVMM – so-called “scribbles,” such as buffer overrun and wild pointer dereference.
3 Pangolin Design
Pangolin allows programmers to build complex, crash-consistent persistent data structures that are also robust in the face of media errors and software “scribbles” that corrupt data. Pangolin satisfies all of the criteria listed in Section 1. This section describes its architecture and highlights the key challenges that Pangolin addresses to meet those requirements. In particular, Pangolin provides the following features unseen in prior works.
It provides fast, space-efficient recovery from media errors and scribbles.
It uses checksums to protect object integrity and supports incremental checksum updates.
It integrates parity and checksum updates into an NVMM transaction system.
It periodically scrubs data to identify corruption.
It detects and recovers from media errors and scribbles online.
Pangolin guarantees that it can recover from the loss of any single 4 KB page of data in a pool. In many cases, it can recover from the concurrent loss of multiple pages.
We begin by describing how Pangolin organizes data to protect user objects, library metadata, and transaction logs using a combination of parity, replication, and checksums. Next, we describe micro-buffers and explain how they allow Pangolin to preserve a simple programming interface and prevent against software from scribbling NVMM. Then, we detail how Pangolin detects and prevents NVMM corruption and elaborate on Pangolin’s transaction implementation with support for efficient, concurrent updates of object parity. Finally, we discuss how Pangolin restores data integrity after corruption and crashes.
3.1 Pangolin’s Data Organization
Pangolin uses replication for its internal metadata and RAID-style parity for user objects to provide redundancy for corruption recovery. Object checksums in Pangolin allow for corruption detection.
Pangolin views a zone’s chunks as a two-dimensional array, arranged in rows and columns as shown in the middle part of Figure 2. Each “chunk row” contains multiple, contiguous chunks and the chunks “wrap around” so that the last chunk of a row and the first chunk of the next are adjacent. Pangolin reserves the last chunk row for parity which lets Pangolin recover data once it identifies corruption.
In our description of Pangolin, we define a page column as a one page-wide, aligned column that cuts across the rows of a zone, and a range column can be arbitrarily-wide (no more than a chunk row’s size).
To detect corruption in user objects, Pangolin adds a 32-bit checksum to the object’s header. The header also contains the object’s size (64-bit) and type (32-bit). The compiler determines type values according to user-defined object types.
Pangolin’s object placement is independent of chunk and row boundaries. The allocator can place objects anywhere within a zone and objects can be of any size (up to the zone size).
In addition to user objects, the library maintains metadata for the pool, zones, and chunks, including allocation bitmaps. Pangolin checksums these data structures to detect corruption and replicates the pool’s and zones’ metadata for fault tolerance. These structures are small (less than 0.1% for pools larger than 1 GB), so replicating them is not expensive. Chunk metadata does not replicate. Instead, zone parity protects it.
Pangolin checksums transaction logs and replicates them for redundancy. It treats log entries in the zone storage as zeros during parity calculations. This prevents parity update contention between log entries and user objects (see Section 3.5).
Fault Tolerance Guarantees Pangolin can tolerate a single 4 KB media error anywhere in the pool, regardless whether it is a data page or a parity page. Based on the bad page’s address Pangolin can locate its page column and restore its data using other healthy pages.
Faults affecting two pages of the same page column may cause data loss if the corrupted ranges overlap. If an application demands more robust fault tolerance, it can increase the chunk row size, reducing the number of rows and, consequently, the likelihood that two corrupt pages overlap.
Pangolin can recover from scribbles (contiguous overwrites caused by software errors) on NVMM data up to a chunk-row size. By default, Pangolin uses 100 chunk rows, and parity consumes 1% of a pool’s size (e.g., 80 MB for an 8 GB pool in our evaluations).
3.2 Micro-buffering for NVMM Objects
|pgl_tx_begin()/commit()/end(), etc.||Control the lifetime of a Pangolin transaction.|
|pgl_tx_alloc()/free()||Allocate or deallocate an NVMM object.|
|pgl_tx_open(PMEMoid oid, ...)||Create a thread-local micro-buffer for an NVMM object. Verify (and restore)|
|the object integrity, and returns a pointer to the micro-buffered user object.|
|pgl_tx_add_range(PMEMoid oid, ...)||Invoke pgl_tx_open and then mark a range of it that will be modified.|
|pgl_get(PMEMoid oid)||Get access to an object, either directly in NVMM or in its micro-buffer,|
|depending on the transaction context. It does not verify the checksum.|
|pgl_open(PMEMoid oid, ...)||Create a micro-buffer for an NVMM object w/o a transaction. Check the|
|object integrity, and return a pointer to the micro-buffered user object.|
|pgl_commit(void *uobj)||Automatically start a transaction and commits the modified user object|
|in micro-buffer to NVMM.|
Pangolin introduces micro-buffering to hide the complexity of updating checksums and parity when modifying NVMM objects. The existence of object checksum and parity makes any modification to an NVMM object also requires updates to them. This raises a challenge for the atomic programming model as shown in Listing 1 (line 3-5) because a single 8-byte NVMM write cannot host all these updates.
Micro-buffering creates a shadow copy of an NVMM object in DRAM, that separates an object’s transient state from its persistent version (Figure 2). In Listing 2, pgl_open initiates a micro-buffer for the node object by allocating a DRAM buffer and copying the node’s data from NVMM. It also verifies the object’s checksum and performs corruption recovery if necessary.
The application can modify the micro-buffered object without concern for its checksum, parity, and crash-consistency because changes exist only in the micro-buffer. When the updates finish, pgl_commit automatically starts a transaction that updates the NVMM object, its checksum, and parity in a crash-consistent way (described below). Compared to line 3-5 of Listing 1, Pangolin retains the simple, atomic-style programming model for modifying a single NVMM object, and it supports updates within an object beyond 8 bytes.
Each micro-buffer’s header contains information such as its NVMM address, modified ranges, and status flags (e.g., allocated or modified). When a transaction involves multiple objects, Pangolin records them in a linked list and uses a hash table (indexed by PMEMoid) to track which objects have micro-buffered versions. After a transaction commits, Pangolin recycles all micro-buffers. We elaborate on Pangolin’s programming interface in Section 3.4.
Moreover, micro-buffering can prevent some programming bugs from corrupting NVMM. If an application’s code can directly write to NVMM, as libpmemobj allows, a buffer overrun bug can easily cause unrecoverable NVMM corruption. Using micro-buffers isolates transient writes from persistent data, and Pangolin inserts a “canary” in each micro-buffer’s header and checks its integrity before writing back to NVMM. On transaction commit, if Pangolin detects a canary mismatch, it aborts the transaction to avoid propagating the corruption to NVMM. For corruption that may bypass the canary protection, Pangolin still detects them using checksums.
3.3 Detecting NVMM Corruption
Pangolin uses three mechanisms to detect NVMM corruption. First, it installs a handler for SIGBUS (see Section 2.2) that fires when the Linux kernel receives an MCE. A signal handler has access to the address the offending load accessed, and Pangolin can determine what kind of data (i.e., metadata or a user object) lives there and recover appropriately. This mechanism detects media failures, but it cannot discover corrupted data caused by software “scribbles.”
To detect scribbles, Pangolin verifies the integrity of user objects using their checksums. Verifying checksums on every access is expensive. To limit this cost, Pangolin verifies checksums at two times. First, it verifies an object’s checksum during micro-buffer creation before the object is modified as part of a transaction. This keeps Pangolin from recalculating the new checksum based on corrupt data. Second, it implements a periodic scrubbing task that verifies and restores the whole pool’s data integrity. We evaluate the impact of checksum verification in Section 4.
Finally, Linux keeps track of known bad pages of NVMM across reboots. When opening a pool or during its scrubbing, Pangolin extracts this information and attempts to recover the data in the reported pages.
3.4 Fault-Tolerant Transactions
Failure-atomic transactions are central to Pangolin’s interface, and they must include verification of data integrity and updates to the checksums and parity data that protect objects. Table 1 summarizes Pangolin’s core functions.
The program in Listing 1 can be easily transformed to Pangolin using equivalent functions. One subtle difference is in the handling of atomic-style NVMM update as shown in Listing 2 (see Section 3.2).
Each transaction manages its own micro-buffers using a thread-local hashmap , indexed by an NVMM object’s PMEMoid. Therefore, in a transaction, calling pgl_tx_open for the same object either creates or finds its micro-buffer. Micro-buffers for one transaction are not visible in other transactions, providing isolation.
If a transaction will modify an object, Pangolin copies it to a micro-buffer, performs the changes there, and then propagates the changes to NVMM during commit. Since changes occur in DRAM (which does not require undo information), Pangolin implements redo logging.
At transaction commit, Pangolin recomputes the checksums for modified micro-buffers, creates and replicates redo log entries for the modified parts of the micro-buffers and propagates these ranges write back to NVMM objects. Then, it updates the affected parity bits (see Section 3.5) and marks the transaction committed. Finally, Pangolin garbage-collects its logs and micro-buffers.
If a transaction aborts, either due to unrecoverable data corruption or other run-time errors, Pangolin discards the transaction’s micro-buffers without touching NVMM.
A transaction can also allocate and deallocate objects. Pangolin also uses redo logging to record NVMM allocation and free operations, just as libpmemobj does.
For read-only workloads, repeatedly creating micro-buffers and verifying object checksums can be very expensive. Therefore, Pangolin provides pgl_get to gain direct access to an NVMM object without verifying the object’s checksum. The application can verify an object’s integrity manually as needed or rely on Pangolin’s periodic scrubbing mechanism. Inside a transaction context, pgl_get returns a pointer to the object’s micro-buffer to preserve isolation.
3.5 Parity and Checksum Updates
Objects in different rows can share the same range of parity, and we say these objects overlap. Object overlap leads to a challenge for updating the shared parity because updates from different transactions must serialize but naively locking the whole parity region sacrifices scalability.
For instance, in Figure 2 if two threads modify two objects, and , replacing with and with , respectively. After both transactions update , the parity should have the value regardless of how the two transaction commits interleave.
Pangolin uses a combination of two techniques that exploit the commutativity of XOR and fine-grained locking to preserve correctness and scalability.
Atomic parity updates The first approach uses the atomic XOR instruction (analogous to an atomic increment) that modern CPUs provide to perform incremental parity updates for changes to each overlapping object.
In our example, we can compute two parity patches: and then rewrite as . Since XOR commutes and is a bit-wise operation, the two threads can perform their updates without synchronization beyond what atomic XOR provides.
Hybrid parity updates Atomic XOR is slower than normal XOR, so for large parity updates, atomic XOR can be inefficient. Therefore, Pangolin’s hybrid parity scheme switches to normal XOR for large transfers. To allow large and small parity updates in parallel, Pangolin uses per-page column, read/write locks. The “readers” perform updates using atomic XOR, and the “writer” uses normal XOR with a page-column lock.
Applications can set the threshold between “small” and “large” updates. For single-threaded applications and applications with little contention, a lower threshold (perhaps 0) will yield the best performance. If contention is more intense, a larger threshold gives better performance.
Pangolin refreshes an object’s checksum in its micro-buffer before updating parity, and it considers the checksum field as one of the modified ranges of the object. Checksums like CRC32 requires recomputing the checksum using the whole object. This can become costly with large objects. Thus, Pangolin uses Adler32 , a checksum that allows incremental updates, to make the cost of updating an object’s checksum proportional to the size of the modified range rather than the object size.
We implement Pangolin’s parity and checksum updates using the Intelligent Storage Acceleration Library (ISA-L) , which leverages SIMD instructions of modern CPUs for these data-intensive tasks.
Applications to other transaction systems Other NVMM persistent object systems could apply Pangolin’s techniques for parity and checksum updates. For example, consider an undo logging (as opposed to Pangolin’s redo logging) system that first stores a “backup” copy of an object in the log before modifying the original in-place. In this case, to add parity calculations, the system could compute a parity patch using the XOR result between the logged data (old) and the object’s data (new). Then, it can apply the parity patch using the hybrid method we described in this section.
3.6 Recovering from Faults
In this section, we discuss how Pangolin recovers data integrity from both NVMM corruption and unexpected system crashes.
Corruption recovery Pangolin uses the same algorithm to recover from errors regardless of how it detects them (i.e., via SIGBUS or a checksum mismatch).
The first step is to abort the current thread’s transaction, prevent the initiation of new transactions by setting the pool’s “freeze” flag, and wait until all outstanding transactions have completed. This is necessary because, during transaction committing, parity data may be inconsistent if a parity update is underway. The scrubbing thread also freezes the pool.
Once the pool is frozen, Pangolin uses the parity bits and the corresponding parts of each row in the page column to recover the missing data.
Pangolin preserves crash-consistency during repair by making persistent records of the bad pages under recovery. Recovery is idempotent, so it can simply re-execute after a crash.
Crash recovery Pangolin handles recovery from a crash using its redo logs. It must also protect against the possibility that the crash occurred during a parity update.
To commit a transaction, Pangolin first ensures its redo logs are persistent and replicated, sets a persistent logging complete mark, and then it can begin updating the NVMM objects and their parity. If a crash happens before redo logs are complete, on reboot Pangolin discards the redo logs without touching objects and parity. If redo logs exist, Pangolin replays them to update objects and then uses the modified object ranges to update parity of the affected page columns.
Pangolin does not log parity updates because it would double the cost of logging. This does raise the possibility of data loss if a crash occurs during a parity update and a media error then corrupts data of the same page column before recovery can complete. This scenario requires the simultaneous loss of two pages in the same page column due to corruption and a crash, which we expect to be extremely rare.
In this section, we evaluate Pangolin’s performance and the overheads it incurs by comparing it to normal libpmemobj and replicated libpmemobj. We start with our experimental setup and then consider its storage requirements, latency impact, scalability, application-level performance, and corruption recovery.
4.1 Evaluation Setup
Since NVMM devices are not commercially available yet, we use DRAM-emulated NVMM  for our evaluation. Our evaluation machine has a Xeon CPU E3-1270 v6 with 8 cores and 32 GB main memory. We use Ubuntu 16.04 with Linux 4.13, and configure a 16 GB emulated persistent memory device. We run NOVA  as the NVMM file system, and applications use mmap() to access an NVMM-resident file.
|Pmemobj||libpmemobj baseline from PMDK v1.5|
|Pangolin||Pangolin baseline w/ micro-buffering only|
|Pangolin-ML||Pangolin + metadata and redo log replication|
|Pangolin-MLP||Pangolin-ML + object parity|
|Pangolin-MLPC||Pangolin-MLP + object checksums|
|Pmemobj-R||libpmemobj w/ one replication in another file|
The CPU provides the CLFLUSHOPT instruction for flushing modified cache lines to NVMM’s persistence domain, and the SFENCE instruction to ensure memory ordering. It also has atomic XOR and AVX instructions that our parity and checksum computations use.
Table 2 describes the operation modes for our evaluations. The Pangolin baseline implements transactions with micro-buffering. It uses buffer canaries to prevent corruption from affecting NVMM, but it does not have parity or checksum for in-NVMM data.
We evaluate versions of Pangolin that incrementally add metadata and log replication (“+ML”), object parity (“+MLP”), and checksums (“+MLPC”). Since metadata updates are small and cheap in our evaluation, we combine its impact with log replication.
Pmemobj-R is the replication mode of libpmemobj which mirrors updates to a replica pool during transaction commit. Comparing Pangolin-MLP and Pmemobj-R is especially useful because the two configurations protect against the same types of data corruption: media errors but not scribbles.
4.2 Memory Requirements
We discuss and evaluate Pangolin’s memory requirements for both NVMM and DRAM.
NVMM All our Pangolin experiments use a single pool of 8 GB. Pangolin replicates all the pool’s metadata in the same file, which occupies a fixed 8 MB. The rest of the space is for user objects and their protection data. By default, Pangolin uses 100 chunk rows, so the parity occupies 1% of the pool’s capacity. Pmemobj-R uses a second 8 GB file as the replica, doubling the cost of NVMM space requirement.
DRAM Pangolin uses malloc()’d DRAM to construct micro-buffers. The required DRAM space is proportional to ongoing transaction sizes. Table 3 summarized the transaction sizes for the evaluated key-value store data structures. Pangolin automatically recycles them on transaction commits. In our experiments, micro-buffering never exceed using 5 MB of DRAM.
4.3 Transaction Performance
Figure 3 illustrates the transaction latencies for three basic operations on an NVMM object store: object allocation, overwrite, and deallocation. Each transaction operates on one object, and we vary the size of the object.
For allocation, latency grows with object size for all five configurations, due to initializing the object and cache line write-back latency. Pangolin incurs 30% - 40% fewer latencies than Pmemobj due to its more efficient non-temporal write-backs. An allocation operation does not involve object logging, so Pangolin-ML shows close performance. Pangolin-MLP adds overhead to update the parity data. Its outperforms Pmemobj-R by between 1.1 and 1.4.
Adding checksum (Pangolin-MLPC) to small objects (less than 1KB) only incurs negligible overhead. For a 4 KB object, Pangolin-MLPC adds 10% more latency compared to Pangolin-MLP. Parity’s impact is a lot larger than checksum’s because updating a parity range demands values from three parts: the micro-buffer, the NVMM object, and the old parity data, while computing a checksum only needs data in a micro-buffer. Moreover, Pangolin needs to flush the modified parity range to persistence, which is the same size as the object. In contrary, writing back a checksum only uses a non-temporal store on a single cache line that holds the checksum value.
Overwriting an NVMM object involves transaction logging for crash consistency. Pangolin and Pmemobj store the same amount of logging data in NVMM, although they use redo logging and undo logging for this purpose, respectively. Since a log entry size is proportional to an object’s modified size, which is the whole object in this evaluation, this cost grows with the object. With Pangolin, log replication accounts for about 10% of the latency. Parity updates consume between 20% to 30% of the extra latency, depending on object size, and checksum updates account for between 5% to 10%. Pangolin-MLP’s performance for overwrites is between 1.1 and 1.3 better than Pmemobj-R.
Deallocation transactions only modify metadata, so their latencies do not change much.
Figure 4 measures Pangolin’s scalability by performing random updates to existing NVMM objects and varying the number of threads.
Pangolin uses reader/writer locks to implement the hybrid parity update scheme described in Section 3.5. The number of rows in a zone and the zone size determine the granularity of these locks: For a fixed zone size, more rows means fewer columns, fewer locks, and coarser lock granularity.
On our evaluation machine we measured the performance of updating parity with atomic XORs becomes worse than making a replication when the modified parity range is greater than 512 bytes, so we set 512 bytes as the threshold for switching methods in our hybrid parity updates.
For small updates (64 B and 256 B) there is no lock contention because they use atomic XOR instructions and can execute concurrently. Large updates (1 KB and 4 KB) lock columns at 4 KB page granularity. Our baseline configuration with 1% parity (80 M) has 20 K pages per row protected by 20 K locks, so the chance of lock contention is slim even with many cores.
The graphs also show how each Pangolin’s fault-tolerance mechanisms affect performance. For small objects (less than 1 KB), Pangolin’s throughput is better than Pmemobj, varying between 1.2 and 1.3. Pangolin-MLP also outperforms Pmemobj-R by a similar amount, and Pangolin-MLPC only performs marginally worse than Pangolin-MLP. For 4 KB objects, Pangolin and Pangolin-MLPC demonstrate close performance compared to Pmemobj and Pmemobj-R, respectively.
Scaling degrades for all configuration as update size and thread count grow because available concurrency or memory bandwidth becomes saturated.
4.5 Impacts on NVMM Applications
To evaluate Pangolin in more complex applications, we use six data structures included in the PMDK toolkit: crit-bit tree (ctree), red-black tree (rbtree), btree, skiplist, radix tree (rtree), and hashmap. They have a wide range of objects sizes and use a diverse set of algorithms to insert, remove, and lookup values. We rewrite these benchmarks with Pangolin’s programming interface as described in Section 3.4.
Table 3 summarizes the object and transaction sizes for each workload. The tree structures and the skiplist have a single type of object which is the tree or list node. The hashmap has two kinds of objects. One is the hash table that contains pointers to buckets. The hash table grows as the application inserts more key-value pairs. Each bucket is a linked list of fixed-sized entry objects.
Each insertion and removal is a transaction processing a key-value pair, and they involve a mix of object allocations, overwrites, and deallocations. The table also shows the average transaction sizes for each operation. Deallocated sizes are not shown because they marginally affect the performance differences (see Figure 3).
The average modified sizes (“Mod” rows in the table) determine the redo log size and affect the performance drop between Pangolin and Pangolin-ML. An average allocation size (“New” rows) smaller than the object size means the data structure does not allocate a new object for every insert operation (e.g., btree). The performance difference between Pangolin-ML and Pangolin-MLP is a consequence of both allocated and modified sizes.
Pangolin is faster than Pmemobj for ctree, rbtree, btree, and hashmap, but slower than Pmemobj for skiplist and rtree. This is because the latter two have relatively larger object sizes and Pangolin pays for the cost by copying their data into micro-buffers. On average, Pangolin is 1.15 faster than Pmemobj and prevents NVMM corruption with micro-buffering.
Pangolin-MLP’s performance is 98% of Pmemobj-R on average, and it saves orders of magnitude NVMM space by using parity data as redundancy. Pangolin-MLPC adds scribble detection and performance drops by 1% to 25% relative to Pangolin-MLP. Adding object checksums impacts rtree’s insert the most because its allocated object size is large, which inevitably requires more checksum computing time.
Pangolin does not impact the lookup performance because it performs direct NVMM reads without constantly verifying object checksums. Pangolin ensures data integrity with its checksum verification policy as discussed in Section 3.3. Figure 6 illustrates the impact of transaction count-based scrubbing on key-value inserts. It depends on the number of user objects in the pool, object sizes, and verification frequency. Applications can adjust verification frequencies according to practical needs for performance and data integrity assurance.
4.6 Error Detection and Correction
Pangolin provides error injection functions to emulate both hardware-uncorrectable NVMM media errors and hardware-undetectable scribbles.
Since our test system does not support NVMM error injection, we use mprotect() and SIGSEGV to emulate NVMM media errors and SIGBUS. When an NVMM file is DAX-mapped, the injector can choose a page that contains application objects, erase it, and call mprotect(PROT_NONE) on the page. Later when the application reads the corrupted object, Pangolin intercepts SIGSEGV, changes the page to read/write mode, and restores the page’s data. The injector function can also scribble an object pool’s metadata or object data in a targeted way.
In both test cases, we observe Pangolin can successfully repair a victim page or an object and resume normal program execution.
We intentionally introduce buffer overflow bugs in our applications, and observed that Pangolin can successfully detect them using micro-buffer canaries. The transaction then aborts to prevent any NVMM corruption.
Pangolin also implements a debug mode for crash recovery, which derives an object’s modified ranges by contrasting its micro-buffer and NVMM versions, and then check if user-specified ranges match. If the derived ranges are greater, it implies the program fails to tell all ranges it actually modifies, which can compromise crash recovery. We found five such bugs222All are accepted to PMDK’s repository as pull requests. when porting PMDK’s examples to Pangolin.
5 Related Work
In this section, we place Pangolin in context relative to previous projects that have explored how to use NVMM effectively.
Transaction Support All previous libraries for using NVMMs to build complex objects rely on transactions for crash consistency. Although we built Pangolin on libpmemobj, its techniques could be applied to any persistent object system. NV-Heaps , Atlas , DCT , and libpmemobj  provide undo logging for applications to snapshot persistent objects before making in-place updates. Mnemosyne , SoftWrAp , and DUDETM  use variations of redo logging. REWIND  implements both undo and redo logging for fine-grained, high-concurrent transactions. Log-structured NVMM  makes changes to objects via append-only logs, and it does not require extra logging for consistency. Romulus  uses a main-back mechanism to implement efficient redo log-based transactions.
None of these systems provide fault tolerance for NVMM errors. We believe they can adopt Pangolin’s parity and checksum design to improve their resilience to NVMM errors at low storage overhead. In Section 3.5 we have described how to apply the hybrid parity updating scheme to an undo logging-based system. Log-structured and copy-on-write systems can adopt the techniques in similar ways.
Fault Tolerance Both Pangolin and libpmemobj’s replication mode protect against media errors, but Pangolin provides stronger protection and much lower space overheads. Furthermore, libpmemobj can only repair media errors offline, and it does not detect or repair any software corruption to user objects.
NVMalloc  uses checksums to protect metadata. It does not specify whether application data is also checksum-protected, and it does not provide any form of redundancy to repair the corruption. NVMalloc uses mprotect() to protect NVMM pages while they are not mapped for writing. Pangolin could adopt this memory protection scheme to prevent an application from scribbling its own persistent data structures.
The NOVA file system [43, 44] uses parity-based protection for file data. However, it must disable these features for NVMM pages that are DAX-mapped for writing in user-space, since the page’s contents can change without the file system’s knowledge, making it impossible for NOVA to keep the parity information consistent if an application modifies DAX-mapped data. As a result, Pangolin is complimentary: It provides fault tolerance when NOVA cannot.
This work presents Pangolin, a fault-tolerant, DAX-mapped NVMM programming library for applications to build complex data structures in NVMM. Pangolin uses a novel, space-efficient layout of data and parity to protect arbitrary-sized NVMM objects combined with per-object checksums to detect corruption. To maintain high performance, Pangolin uses micro-buffering, carefully-chosen parity and checksum updating algorithms, and efficient implementations. As a result, Pangolin provides stronger protection, better availability, and vastly lower storage overheads than existing NVMM programming libraries.
-  Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. Atlas: Leveraging Locks for Non-volatile Memory Consistency. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA’14, pages 433–452. ACM, 2014.
-  Andreas Chatzistergiou, Marcelo Cintra, and Stratis D Viglas. REWIND: Recovery Write-ahead System for In-Memory Non-volatile Data-Structures. Proceedings of the VLDB Endowment, 8:497–508, 2015.
-  Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. NV-Heaps: Making Persistent Objects Fast and Safe with Next-generation, Non-volatile Memories. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’11, pages 105–118. ACM, 2011.
-  Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. Better I/O through Byte-addressable, Persistent Memory. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating systems principles, SOSP’09, pages 133–146. ACM, 2009.
-  Andreia Correia, Pascal Felber, and Pedro Ramalhete. Romulus: Efficient Algorithms for Persistent Transactional Memory. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, SPAA’18, pages 271–282. ACM, 2018.
-  Dan Williams. libnvdimm for 4.12, 2017. https://lkml.org/lkml/2017/5/5/620.
-  Dan Williams. libnvdimm for 4.13, 2017. https://lkml.org/lkml/2017/7/6/843.
-  Dan Williams. use memcpy_mcsafe() for copy_to_iter(), 2018. https://lkml.org/lkml/2018/5/1/708.
-  Joel E. Denny, Seyong Lee, and Jeffrey S. Vetter. NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC’16, pages 125–136. ACM, 2016.
-  Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. System Software for Persistent Memory. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys’14, page 15. ACM, 2014.
-  Ellis R Giles, Kshitij Doshi, and Peter Varman. SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory. In Mass Storage Systems and Technologies (MSST), 2015 31st Symposium on, MSST’16, pages 1–14. IEEE, 2015.
-  Qingda Hu, Jinglei Ren, Anirudh Badam, Jiwu Shu, and Thomas Moscibroda. Log-Structured Non-Volatile Main Memory. In Proceedings of the USENIX Annual Technical Conference, ATC’17, pages 703–717. USENIX Association, 2017.
-  Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’12, pages 111–122. ACM, 2012.
-  Intel. Intel Architecture Instruction Set Extensions Programming Reference, 2017. https://software.intel.com/en-us/isa-extensions.
Introduction to Programming with Persistent Memory from Intel,
-  Intel. Persistent Memory Programming - Frequently Asked Questions, 2017. https://software.intel.com/en-us/articles/persistent-memory-programming-frequently-asked-questions.
-  Intel. Intelligent Storage Acceleration Library, 2018. https://software.intel.com/en-us/storage/isa-l.
-  Joseph Izraelevitz, Terence Kelly, and Aasheesh Kolli. Failure-Atomic Persistent Memory Updates via JUSTDO Logging. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’16, pages 427–442. ACM, 2016.
-  Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen, and Thomas F. Wenisch. High-Performance Transactions for Persistent Memories. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’16, pages 399–411. ACM, 2016.
-  Harendra Kumar, Yuvraj Patel, Ram Kesavan, and Sumith Makam. High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System. In 15th USENIX Conference on File and Storage Technologies, FAST’17, pages 197–212. USENIX Association, 2017.
-  Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J. Freedman. Algorithmic Improvements for Fast Concurrent Cuckoo Hashing. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys’14, pages 27:1–27:14. ACM, 2014.
-  Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei Wu, Weimin Zheng, and Jinglei Ren. DudeTM: Building Durable Transactions with Decoupling for Persistent Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’17, pages 329–343. ACM, 2017.
-  Tony Luck. Patchwork mm/hwpoison: Clear PRESENT Bit for Kernel 1:1 Mappings of Poison Pages, 2017. https://patchwork.kernel.org/patch/9793701.
-  Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee, Yanqi Zhou, Ramnatthan Alagappan, Karin Strauss, and Steven Swanson. Atomic in-place updates for non-volatile main memories with kamino-tx. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys’17, pages 499–512. ACM, 2017.
-  Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS’15, pages 177–190. ACM, 2015.
-  Micron. 3D XPoint Technology, 2017. http://www.micron.com/products/advanced-solutions/3d-xpoint-technology.
-  Micron. Hybrid Memory: Bridging the Gap Between DRAM Speed and NAND Nonvolatility, 2017. http://www.micron.com/products/dram-modules/nvdimm.
-  Iulian Moraru, David G Andersen, Michael Kaminsky, Niraj Tolia, Parthasarathy Ranganathan, and Nathan Binkert. Consistent, Durable, and Safe Memory Management for Byte-addressable Non Volatile Main Memory. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, TRIOS’13. ACM, 2013.
-  Sanketh Nalli, Swapnil Haria, Mark D Hill, Michael M Swift, Haris Volos, and Kimberly Keeton. An Analysis of Persistent Memory Use with WHISPER. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’17, pages 135–148. ACM, 2017.
-  Dushyanth Narayanan and Orion Hodson. Whole-system Persistence. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’12, pages 401–410. ACM, 2012.
-  Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. SSD Failures in Datacenters: What? When? And Why? In Proceedings of the 9th ACM International on Systems and Storage Conference, SYSTOR’16, pages 7:1–7:11. ACM, 2016.
-  Matheus Almeida Ogleari, Ethan L Miller, and Jishen Zhao. Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, HPCA’18, pages 336–349. IEEE, 2018.
-  Christian Perone and David Murray. pynvm: Non-volatile memory for Python, 2017. https://github.com/pmem/pynvm.
-  pmem.io. How to emulate persistent memory, 2016. https://pmem.io/2016/02/22/pm-emulation.html.
-  pmem.io. Persistent Memory Development Kit, 2017. http://pmem.io/pmdk.
-  Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutiu. ThyNVM: Enabling software-transparent crash consistency in persistent memory systems. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO’15, pages 672–685. IEEE, 2015.
-  Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. Understanding Latent Sector Errors and How to Protect Against Them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST’10. USENIX Association, 2010.
-  Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS’09, pages 193–204. ACM, 2009.
-  Jihye Seo, Wook-Hee Kim, Woongki Baek, Beomseok Nam, and Sam H Noh. Failure-Atomic Slotted Paging for Persistent Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’17, pages 91–104. ACM, 2017.
-  Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’15, pages 297–310. ACM, 2015.
Advanced configuration and power interface specification, 2017.
-  Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: Lightweight Persistent Memory. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’11, pages 91–104. ACM, 2011.
-  Jian Xu and Steven Swanson. NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories. In Proceedings of the 14th USENIX Conference on File and Storage Technologies, FAST’16, pages 323–338. USENIX Association, 2016.
-  Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP’17, pages 478–496. ACM, 2017.
-  Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST’15, pages 167–181. USENIX Association, 2015.
-  Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. End-to-end Data Integrity for File Systems: A ZFS Case Study. In Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST’10, pages 3–3. USENIX Association, 2010.