A Generic Checkpoint-Restart Mechanism for Virtual Machines
It is common today to deploy complex software inside a virtual machine (VM). Snapshots provide rapid deployment, migration between hosts, dependability (fault tolerance), and security (insulating a guest VM from the host). Yet, for each virtual machine, the code for snapshots is laboriously developed on a per-VM basis. This work demonstrates a generic checkpoint-restart mechanism for virtual machines. The mechanism is based on a plugin on top of an unmodified user-space checkpoint-restart package, DMTCP. Checkpoint-restart is demonstrated for three virtual machines: Lguest, user-space QEMU, and KVM/QEMU. The plugins for Lguest and KVM/QEMU require just 200 lines of code. The Lguest kernel driver API is augmented by 40 lines of code. DMTCP checkpoints user-space QEMU without any new code. KVM/QEMU, user-space QEMU, and DMTCP need no modification. The design benefits from other DMTCP features and plugins. Experiments demonstrate checkpoint and restart in 0.2 seconds using forked checkpointing, mmap-based fast-restart, and incremental Btrfs-based snapshots.
A generic mechanism is presented for checkpointing virtual machines. Snapshots of virtual machines are a key technology for dependable computing. They are more important today than ever for deployment in clouds (including IaaS, “Infrastructure as a Service”), rapid deployment (starting from an initial snapshot), migration between hosts, fault tolerance for dependability, and greater security by insulating a guest VM from the host. Current virtual machines rely on machine-specific checkpoint-restart mechanisms. Such designs struggle with common checkpointing issues (live checkpointing without stopping the virtual machine, incremental checkpointing, differential checkpointing, forked checkpointing (checkpointing within a forked child process) concurrently with execution within a parent, and checkpointing of distributed virtual machines.
By employing a standard checkpoint-restart package, the virtual machine directly inherits all of the features of that checkpoint-restart package. A further key difference of the new approach is that the checkpoint-restart package operates externally to the virtual machine. Because it is not embedded inside the guest virtual machine or the hypervisor, there is greater flexibility for it to interact as a standard process within the hypervisor or host operating system.
As an example, a desktop user can now gain very high reliability by running within a virtual machine that is set to take a snapshot every minute. Section 5 demonstrates that the DMTCP features of forked checkpointing and mmap-based fast restart enable a virtual machine snapshot in 0.2 seconds, when running with the Btrfs filesystem. That section also shows a run-time overhead that is too small to be measured when running the nbench2 benchmark program. Btrfs is expected to become the default filesystem for Fedora, Ubuntu, and others in about a year.
The generic mechanism of this work is based on the DMTCP checkpoint-restart package [AAC09]. The mechanism is demonstrated on three types of virtual machines: KVM/QEMU [KVM12], user-space (standalone) QEMU [Qem12], and Lguest [Rus12]. In all three cases, the hypervisor (VMM — virtual machine monitor) is based on Linux as the host operating system. The three examples cover three distinct situations: entirely user-space virtualization (QEMU), full virtualization using a Linux kernel driver (KVM/QEMU), and paravirtualization using a Linux kernel driver (Lguest). (A paravirtualized virtual machine is a virtual machine that requires modifications to the host operating system.)
By providing checkpoint-restart capability to existing virtual machines, one can retroactively add snapshot capability to a virtual machine in a mostly transparent manner. DMTCP already checkpoints user-space QEMU “out of the box”. An additional DMTCP-based plugin is required for KVM/QEMU, but neither KVM/QEMU nor DMTCP is modified. In the case of Lguest, the kernel driver from Lguest require about 40 lines of new code to support the checkpoint-restart capability.
The additional code required to checkpoint a new virtual machine is approximately 200 lines (for plugins in the case of KVM/QEMU and Lguest). Since user-space QEMU has no kernel driver component, DMTCP is able to directly checkpoint it.
Given our experience it is estimated that someone familiar with the examples provided here, could implement checkpoint-restart for a new virtual machine in approximately five person-days — assuming that the VM provides a kernel driver API, as is the case for KVM. (KVM is the kernel driver component of a KVM/QEMU virtual machine.) Where no kernel driver API is provided, the development time is estimated at ten days, due to the need to understand VM kernel driver internals, and augment the existing API between driver kernel space and user space.
The two virtual machines above (KVM/QEMU and Lguest) require the estimated effort primarily due to the need to save state within the kernel driver, and then to appropriately restore and patch the state within the kernel driver at the time of restart.
Surprisingly, DMTCP was able to checkpoint user-space QEMU directly, with no requirements for new code or new plugins. In hindsight, this is attributed to the fact that user-space QEMU has no kernel driver, and so no communication between kernel space and user-space. In experiments, DMTCP and QEMU were used to checkpoint both the Linux and Windows guest operating systems “out of the box”, with no additional modifications.
For all three virtual machines, DMTCP [AAC09, DMT12] is used for purposes of checkpointing. DMTCP is a widely used user-space transparent checkpoint-restart package. DMTCP was chosen in part for the sake of its support for third-party plugins (see Section 2).
Of the three virtual machines on which generic checkpoint-restart is demonstrated, to the best of our knowledge Lguest has not previously been checkpointed. QEMU provides a “savevm” command for directly checkpointing. KVM/QEMU has been previously checkpointed by modifying existing features within KVM [SC11, CLO08] and making use of the QEMU savevm command.
By checkpointing using an external, generic checkpoint-restart package, one immediately inherits DMTCP’s ability to take a consistent distributed snapshot. This is useful for analyzing any type of distributed computation. Furthermore, there is the opportunity to easily extend this work in such future directions as:
fork-based checkpointing (quiesce the VM process, and fork a child VM process to be checkpointed, while the parent continues to execute, using the copy-on-write semantics of fork);
heterogeneous checkpointing (checkpointing different virtual machines and “bare” processes running a distributed computation); and
incremental and differential checkpointing (checkpointing only that part of RAM that has changed since the last checkpoint).
DMTCP has already been used for fault-tolerant applications, while demonstrating each of the above features. Thus, checkpoint-restart of virtual machines can be extended to take advantage of such features. At the same time, the checkpoint-restart capability remains largely orthogonal to the ongoing internal development of the virtual machine packages.
Snapshots (including filesystem).
A distinction is sometimes made between checkpoints and snapshots when the terminology is applied to virtual machines. A checkpoint is a copy of the state of a virtual machine suitable for being restored. Such a checkpoint may or may not include saving a copy of the filesystem. A snapshot always includes a full copy of the filesystem.
In a snapshot, rather than copy the entire filesystem during each checkpoint, one prefers to use a filesystem supporting copy-on-write in order to take a snapshot. Filesystems that support copy-on-write usually also support incremental snapshots.
This is a stable filesystem that is likely to be readily available in most future Linux distributions. Btrfs has been in the mainline Linux kernel since 2009 (since Linx 2.6.29). Both Fedora and Ubuntu are planning for Btrfs to be the default filesystem in late 2013 or later.
The experimental section uses copy-on-write incremental snapshots, based on Btrfs [RBM12], for most experiments. The time to take a snapshot of the guest filesystem tends to be too small to measure as part of the total restart time.
Forked checkpointing and fast restart:
The checkpoint and restart can be sped up through standard features of DMTCP. At checkpoint time, forked checkpointing is employed. The guest VM (viewed as a process in the host) forks itself, and the child process is checkpointed. Fast restart uses mmap to map the checkpoint image into RAM. This allows the memory pages to be demand paged in as needed. Forked checkpointing reduces the delay for a checkpoint to approximately 0.2 seconds (while the child process continues to write out the checkpoint image), and the fast restart time is about 0.1 seconds.
In the rest of this paper, Section 2 provides background on DMTCP plugins. Section 3 describes the generic mechanism for checkpoint-restart of virtual machines. Section 4 describes several challenges in the implementation, in order to provide deeper insights into the issues in implementing the generic mechanism. Finally, Section 5 provides experimental results, Section 6 describes related work, and Section 7 provides the conclusion.
2 DMTCP Plugins: Background
DMTCP (Distributed MultiThreaded CheckPointing) [AAC09] is used to checkpoint and restart a virtual machine. The current version of DMTCP (DMTCP-1.2.6) [DMT12] provides a facility for third-party plugins. The work described here was based on DMTCP svn revision 1755.
When a new virtual machine is launched (e.g. QEMU), the user prefixes the launch command with dmtcp_checkpoint. A checkpoint image is then created, and the virtual machine is restarted via dmtcp_restart:
dmtcp_checkpoint --with-plugin dmtcp_VM_plugin.so qemu ...
In the above scenario, VM would be KVM or LGUEST. The plugin dmtcp_VM_plugin.so is the additional code developed for this work.
Plugins allow the functionality of DMTCP to be extended without modification to the underlying DMTCP binary. For the purposes of this work, we use two essential features of DMTCP plugins.
Wrapper functions: DMTCP provides wrapper functions around calls to library functions. In particular, it supports wrappers around system calls.
DMTCP event handling: DMTCP notifies the plugins of several events. DMTCP blocks while plugins process events. The most important events for our purposes are pre-checkpoint and post-restart.
A wrapper function is a function that is interposed between the caller and a callee. If the base code calls a function foo, and if a wrapper function bar is interposed between the base code and foo, then the base code calls bar instead. DMTCP provides a mechanism for plugins to transparently insert such wrapper functions around any library call, including system calls. In typical usage, the wrapper function will then call the interposed functions (although possibly with modified arguments), and then pass back a (possibly modified) copy of the return value of the interposed function. For a review of the many techniques for interposition, see [TL01].
The VM kernel driver may interpose its own wrapper functions around system calls that refer to a device supported by the virtual machine. (Traditional kernel terminology does not call this a wrapper function.) For example, KVM creates wrappers for the device /dev/kvm, and Lguest creates wrappers for the device /dev/lguest. Thus, the DMTCP plugin is effectively creating its own wrapper function around a VM-supplied wrapper, which in turn delegates to the kernel for the standard functionality.
Next, we discuss DMTCP events. During a pre-checkpoint event, all user threads have been quiesced, and DMTCP has not yet begun to save the process state (including the state of memory). During a post-restart event, DMTCP has finished restoring process state (including all of memory), but control has not yet been returned to the user threads.
This design allows the plugin to save additional state relevant to the virtual machine at the time of checkpoint. During the post-restart event, the checkpoint-restart appears transparent to the plugin. Hence, the plugin finds the VM state in whatever data structure that the plugin had originally used to save the information.
The plugin uses a virtual-machine-specific method to transfer data between the kernel driver of the virtual machine and the user-space memory where the plugin “lives”. Sections 3.4 and 3.5 describes those VM-specific mechanisms.
A typical use of a DMTCP plugin is to use a wrapper function to record information by certain system calls issued by the launcher. This allows the DMTCP plugin to execute modified versions of those same system calls at the time of restart, before the thread of control is handed back to the virtual machine.
3 Generic Mechanism for Checkpoint-Restart
In this section, we describe the general mechanism for checkpointing and restarting a virtual machine. In the rest of this section, Section 3.1 provides an overview of the actions of our DMTCP plugin in supporting checkpoint-restart. Section 3.2 describes a generic sequence of steps that any virtual machine must employ in launching a new virtual machine. Section 3.3 then describes the steps needed to restore and restart that virtual machine. Finally, the generic mechanism depends on the APIs provided by the virtual machine (or augmented APIs in the case of Lguest). Sections 3.4 and 3.5 describe those APIs for KVM and Lguest, respectively. The APIs are responsible for saving the VM driver state, and later for launching a shell VM, and then restoring the VM driver state.
The existing DMTCP package already transparently checkpoints and restores all of user-space memory, along with essentially all pertinent process state (threads, open file descriptors, associated terminal device, stdin/stdout/stderr, sockets, shared memory regions, etc.). Where a subsystem refers to an external object, DMTCP has several subsystem-specific heuristics for restoring such information. Examples of such cases abound: open files that were modified or re-named after checkpoint; sockets to database servers; shared memory regions with daemons such as NSCD; etc.
In the case of user-space QEMU, the existing DMTCP package and its heuristics for restoring subsystems sufficed to correctly checkpoint and restart QEMU. No DMTCP plugin was required. This was tested with QEMU running each of Linux and Microsoft Windows. (See Section 5.) The rest of this section is concerned with KVM/QEMU and Lguest, for which a DMTCP plugin was required.
3.1 Overview of Actions of DMTCP Plugin
For the KVM/QEMU and Lguest virtual machines, a DMTCP plugin was implemented to save and restore state contained in the VM kernel driver. Recall from Section 2 that the two features of DMTCP plugins we use are wrapper functions and notification of the pre-checkpoint/post-restart events. Wrapper functions are used to record information about system calls sent by the VM launcher to the kernel. At the time of pre-checkpoint, the plugin saves certain state within the VM kernel driver. Since that information is contained in the plugin’s user-space memory, DMTCP automatically saves it at checkpoint time and later restores it, as part of DMTCP’s standard procedure for saving and restoring all of user-space memory. At the time of post-restart, the plugin copies that state back into the VM kernel driver and appropriately patches it.
In addition to implementing a VM-specific plugin, one must modify about 40 lines in in Lguest (in lguest_user.c). This is because KVM provides an API for communication between the VM kernel driver and the DMTCP plugin. Lguest does not. Hence, we have augmented the API of Lguest. This is needed to enable the plugin to save and restore state.
In overview, the DMTCP plugin is needed only in the case of VM kernel drivers (KVM/QEMU and Lguest in this work). It does the following:
Time of Original VM Launch: Wrapper functions in the DMTCP plugin record pertinent information from the system calls made by the VM launcher. This information is used to later restore the configuration of memory, etc., of the new virtual machine created by the VM launcher.
Checkpoint Time: The DMTCP plugin is notified of the pre-checkpoint event after the user threads have been quiesced, and before all of user-space memory is copied to a checkpoint image. The DMTCP plugin then copies pertinent information from the data structures inside the VM kernel driver. This uses a kernel driver API to user space (KVM/QEMU), or else an augmented driver provided by us (Lguest).
Restart Time (restoring user-space memory of the VM): DMTCP restores user-space memory to the same addresses where they existed prior to checkpoint. DMTCP does this transparently, and the DMTCP plugin does not do any work at this stage.
Restart Time (re-launching the VM): The DMTCP plugin is notified when all user-space memory and process state has been restored, but before control is returned to the user threads. At this time, the user-space component of the VM has been fully restored. But the pre-checkpoint VM does not exist, and so the VM kernel driver is not aware of any VM’s. The DMTCP plugin replays a modified version of the first few system calls by the VM launcher. It replays just enough to provide an “empty shell” of a virtual machine. Many of the VM kernel driver data structures have not been initialized, and for some data structures, not even storage has been allocated.
Restart Time (patching the VM kernel driver): The DMTCP plugin must now copy its saved VM kernel driver state back into the VM kernel driver. However, in some cases, that VM kernel driver state must be modified to account for the fact that this is not the original VM, and the kernel may have changed some of the memory addresses in this re-launched VM. This uses a kernel driver API to user space (KVM/QEMU), or else an augmented driver provided by us (Lguest).
3.2 Launching a Virtual Machine
We first describe in general terms how a virtual machine is launched (created). Any particular virtual machine may differ in some detail, or may merge or sub-divide the steps described. It is partly for this reason, that we do not currently see the possibility of a fully transparent checkpoint-restart mechanism. However, this general description provides a framework that can be used to accelerate the development of a plugin for a new virtual machine.
Figure 2 shows those portions of a virtual machine of interest for checkpoint-restart. Typically, the virtual machine is created by a command issued from user-space. The program run by that command is referred to as a VM launcher, which sets up, runs and services the Guest.
The launcher must:
open an interface to the host kernel via a character device (e.g. /dev/kvm or /dev/lguest).
initialize the VM: tell the kernel where is the start of the guest physical memory in the launcher’s virtual address space.
arrange to virtualize IRQ interrupts.
create and initialize virtual CPUs to hold the current state of the registers.
run the guest.
3.3 Re-Starting a Virtual Machine from a Checkpoint Image
In restoring the memory after checkpoint, the user-space memory (memory of QEMU) is restored exactly. The memory within the VM kernel driver must be restored in one of three ways.
Launch a “shell” of a new VM in kernel driver: On restart, the DMTCP plugin executes the first few steps of launching a VM, in order to create an empty shell for the VM data structures. (See Figure 3.) We refer to this as “re-launch”.
Restore pre-checkpoint state of kernel driver: Next, we identify those data structures of the VM kernel driver that have not yet been initialized. For those data structures, we design the DMTCP plugin to save the values at the time of checkpoint, and restore the values at the time of restart.
Patch kernel driver state: Finally, some of the data values that are restored are incorrect. These must be filled in correctly on a case-by-case basis. For example, at the time of launching a new VM, KVM dynamically allocates memory for a struct that describes the memory addresses where user-space QEMU resides. At the time of restart, the kernel is unlikely to allocate the new struct at the same address as before. The DMTCP plugin must save the data from the old struct prior to checkpoint and use it within the new struct allocated at the time of restart. KVM provides an API for this purpose, while for Lguest the API was augmented.
3.4 Case History: APIs used by DMTCP Plugin for KVM
QEMU uses KVM’s ioctl commands to check for the different hardware capabilities and configures data structures internal to the kernel driver. They represent the state of the virtual machine. Once configured, these data structures can be read from QEMU using ioctl system calls with different GET_XXX parameters. The DMTCP plugin retrieves values of relevant data structures for task state segment address, guest registers, a programmable interval timer, the IRQ chip and the registers of the virtual CPU.
For certain internal kernel driver data structures, there is a SET_XXX parameter, but no corresponding GET_XXX parameter. Hence, the DMTCP plugin defines a wrapper function around ioctl, and monitors the initialization of the missing data structures via calls by the VM launcher of ioctl. Upon restart, the plugin (running inside the launcher process) issues an ioctl call with the appropriate SET_XXX parameter and the appropriate values discovered during the original launch.
3.5 Case History: APIs used by DMTCP Plugin for Lguest
Lguest already employs the read and write system calls to pass the parameters needed in VM launch as described in Section 3.2. These system calls were extended to provide an API for reading and writing internal data structures of the kernel driver. Another alternative would have been to augment the API using ioctl, similarly to the situation under KVM. Some of the data structures saved and restored are the virtual cpu, registered eventfd objects,and the address of the guest physical address, stack, page directory, etc.
4 Implementation Challenges
Two particular implementation issues required special treatment.
In checkpoint-restart of virtual machines, most modern operating systems support GUI interfaces. Hence, the graphics of the GUI must be checkpointed and restored. A standard trick is used. The virtual machine is run inside TightVNC, an example of a VNC client-server for virtual network computing. In particular, QEMU starts up a vncserver for the graphics at the time of launching. A VNC viewer connects to the VNC server. Just prior to checkpoint, we disconnect the VNC viewer, and we reconnect after resuming or restarting the guest VM.
Anonymous inodes for KVM
When KVM launches a new virtual machine, it maps a region from kernel space to user-space memory for convenience of communication between the kernel-level KVM driver and the user-space QEMU component. This is implemented by having QEMU call mmap on an anonymous inode. (An anonymous inode will be deleted when no object continues to refer to it. Since the anonymous inode is associated with the KVM node, when QEMU calls mmap, the call is intercepted by the KVM driver, and it arranges for the kernel space that will be mapped into user space.) This occurs when KVM launches a new virtual machine.
At the time of restart, the DMTCP plugin must then re-create the memory region for sharing between user space and kernel space. Since the user-space component (QEMU) will be restored exactly at restart time, it will retain pointers into the address of the mapped region backed by the anonymous inode, as it existed prior to checkpoint.
The DMTCP plugin handles this issue in two phases: prior to checkpoint, and at the time of restart. Prior to checkpoint (during the original launch), the wrapper for mmap inside the DMTCP plugin detects the call by QEMU for the specific anonymous inode in question. The return value of mmap is then saved by the DMTCP plugin. At the time of restart, the DMTCP plugin calls mmap, and specifies the anonymous inode and the original address where it had been mapped. In addition, the DMTCP plugin mmap wrapper invokes the parameter MAP_FIXED in order to re-map the region at the desired address.
|Mem. (MB)||Mem. (MB)||Ckpt (s)||Restart (s)||Image||Ckpt (s)||Restart (s)||Image||Ckpt (s)||Restart (s)||Image|
|128||2.5||2.292||1.264||30 MB||3.949||1.308||44 MB||4.342||1.686||59 MB|
|256||4.2||3.169||1.382||33 MB||6.424||2.353||89 MB||7.705||3.017||109 MB|
|512||184||5.390||2.417||35 MB||9.886||3.278||129 MB||11.870||4.427||170 MB|
|768||441||6.823||3.013||38 MB||9.212||3.307||130 MB||14.039||5.047||194 MB|
|1024||700||8.339||2.986||37 MB||10.033||3.130||122 MB||16.504||5.467||208 MB|
|Memory (MB)||Ckpt (s)||Restart (s)||Image Size||Ckpt (s)||Restart (s)||Image Size||Ckpt (s)||Restart (s)||Image Size|
|128||0.157||1.183||30 MB||0.183||1.284||44 MB||0.161||1.697||59 MB|
|256||0.174||1.426||32 MB||0.200||2.379||90 MB||0.165||2.985||111 MB|
|512||0.176||2.523||35 MB||0.233||3.061||122 MB||0.174||4.435||171 MB|
|768||0.174||2.447||36 MB||0.211||3.106||122 MB||0.183||4.970||191 MB|
|1024||0.178||2.818||37 MB||0.243||2.964||116 MB||0.191||5.633||213 MB|
5 Experimental Results
We ran our experiments on a system with an Intel Core i7 (2.3 GHz) and 8 GB of RAM. This was part of a MacBook laptop with a 256 GB SSD. The host operating system was a 32-bit version of Ubuntu-12.10 with Linux kernel-3.5.7. The host was running natively in its own partition on the MacBook. The guest was set up to run Ubuntu-8.04. DMTCP svn revision 1755 was used for all experiments.
All experiments represent full snapshots, including a snapshot of the guest filesystem. The guest filesystem appears as a single file within the host filesystem. Unless otherwise noted, the guest filesystem is located within a Btrfs filesystem of the host operating system. Checkpoint includes the time to create a snapshot of the guest filesystem within Btrfs. The snapshot of the guest filesystem is created using the GNU binutils command “cp --reflink”. This operation tends to be fast, since it primarily involves taking a snapshot of the current data blocks of the host file comprising the guest filesystem.
Experiments were conducted for: broad coverage (Section 5.2); forked checkpointing (Section 5.3); tests of DMTCP features for forked checkpointing and mmap-based fast restart (Section 5.4); analyzing the impact of running the nbench2 benchmark program (Section 5.5); and the overhead of saving snapshots of the guest filesystem on a host Btrfs filesystem (Section 5.6).
5.2 Coverage tests
Table 1 demonstrates the memory-intensive version of checkpoint-restart using the default mode of DMTCP (using gzip compression) on an idle virtual machine. The checkpoint times grow roughly proportionally to the size of the allocated memory for the larger sizes (512 MB guest VM to 1024 MB guest VM). Below those memory sizes, other factors in the checkpoint times presumably dominate. Restart times do not change appreciably at the higher ranges of memory.
5.3 Forked checkpointing
Forked checkpointing on an idle virtual machine is demonstrated in Table 2. This uses the “--enable-forked-checkpointing” configure option of DMTCP, such that at checkpoint time, a child process is created. The child fulfills the rest of the checkpoint, while the parent process continues computing concurrently. As would be expected, the parent completes its portion of the checkpoint largely independently of the size of the checkpoint image or allocated memory. Forked checkpointing typically requires about 0.2 seconds. Since the checkpoint was taken while the virtual machine was running, it was not possible to take checkpoints at the same time within the two runs (forked checkpointing and standard). For this reason, the sizes of the images differ by approximately 2.5%, as seen in Table 1.
The times for checkpoint and restart for KVM/QEMU are larger than the times for user-space QEMU. This is because the plugin for KVM/QEMU makes extra system calls at checkpoint and restart time. The times can be reduced by modifying the kernel driver to implement a new system call that coalesces all of the operations of the previous system calls.
5.4 Fast Restart
|Memory (MB)||Ckpt (s)||Restart (s)||Image Size||Ckpt (s)||Restart (s)||Image Size||Ckpt (s)||Restart (s)||Image Size|
|128||0.523||0.096||139 MB||0.689||0.097||182 MB||0.593||0.098||230 MB|
|256||0.834||0.098||267 MB||1.098||0.092||311 MB||1.329||0.096||408 MB|
|512||1.489||0.097||523 MB||1.843||0.098||566 MB||2.437||0.097||761 MB|
|768||2.495||0.097||779 MB||2.523||0.094||823 MB||3.539||0.096||1.1 GB|
|1024||3.021||0.098||1.1 GB||3.119||0.098||1.1 GB||4.480||0.097||1.5 GB|
|(MB)||Checkpoint (s)||Restart (s)||Image Size|
Table 3 employs fast restart on an idle virtual machine using the “--enable-fast-ckpt-restart” option of DMTCP. This option uses mmap to map the checkpoint image directly into memory, instead of copying it. In this case, memory is demand-paged in as needed from the checkpoint image. In this mode, compression is not used in creating the checkpoint image. Checkpoint times are somewhat faster in writing an uncompressed checkpoint image to disk, since the time for executing gzip (compression) dominates over the time to write to disk.
Table 4 presents the results of combining both fast-restart and forked-checkpointing mechanisms on QEMU/KVM. Note that on restart from a checkpoint image, the shadow page tables inside the kernel must be recreated, after which the pages will be faulted back into RAM. The impact of this on the performance of the running applications within the guest operating system is not captured by these tables. The tables indicate only the time after which the virtual machine can begin to execute.
5.5 The nbench2 benchmark program
The numbers in Table 5 demonstrate the small overhead of executing with DMTCP. DMTCP incurs this overhead due to its use of wrapper functions around certain system calls. We used the nbench2 benchmark program [May] to analyze the overhead under conditions of stress. The nbench2 benchmark program is a collection of applications that stress the cpu and the memory. The applications stress the integer unit, the floating-point unit and the memory subsystem. The indexes in Table 5 are a measure of performance, normalized with respect to the AMD K6/233. Higher numbers are better.
Table 5 shows that DMTCP has little impact on performance for a VM running cpu-intensive or memory-intensive loads. In contrast the performance of KVM/QEMU is much higher than user-space QEMU, as expected.
Table 6 shows the large impact of using DMTCP optimizations to enhance the checkpoint and restart times. Further, one can compare the effect of running a virtual machine under load with an idle virtual machine. Table 6 shows a machine under load (running nbench2), while Tables 1, 2 and 3 show an idle machine. The checkpoint and restart times are almost the same in the two cases. The size of the checkpoint image increases by at most 7.2% when under load. This is due to fewer zero page when under load.
|Memory Index||Integer Index||Floating-point Index||Memory Index||Integer Index||Floating-point Index|
|Checkpoint Mechanism||KVM/QEMU||QEMU (user-space)|
|Checkpoint (s)||Restart (s)||Image Size||Checkpoint (s)||Restart (s)||Image Size|
|Default-ckpt||9.915||3.203||125 MB||15.154||5.967||226 MB|
|Forked-ckpt||0.214||3.171||125 MB||0.188||5.902||226 MB|
|Fast-restart||3.245||0.098||1.1 GB||4.382||0.093||1.5 GB|
|Forked-ckpt/Fast-restart||0.206||0.095||1.1 GB||0.212||0.122||1.5 GB|
|Checkpoint (s)||Restart (s)|
Table 7 shows the advantage of using the copy-on-write feature of Btrfs to store the guest VM’s filesystem. At checkpoint time a small additional DMTCP plugin rapidly copies the state of the entire filesystem (which appears as a single file on the host filesystem), using the --reflink option of the GNU binutils copy command. At restart time the state of the guest filesytem is similarly copied back.
6 Related Work
Xen has offered checkpointing at least since [VNOS06]. A faster checkpoint-restart based on COW (copy-on-write filesystems) was developed independently by two groups [SB05, Col12]. Later, support for deduplication in Xen checkpoints was described in [PEL11].
QEMU can be checkpointed by issuing the “stop” command, followed by the “savevm” command. This capability has been enhanced in the case of QEMU running on top of KVM (kernel-based virtual machine). This was done by modifying KVM to add an additional checkpoint thread [SC11] and similarly by modifying the live migration facility of KVM to save a copy while migrating the ongoing VM computation [CLO08].
For the support of snapshots, one requires a copy-on-write filesystem. A common current choice is QCOW2 [McL08], which supports the creation of incremental snapshots. Another recent choice is BlobSeer [NAB11], as used in [NC11, Section 3.3]. That choice has the advantage of exposing the raw checkpoint image file to the host operating system or hypervisor.
The work described here uses Btrfs [RBM12]. Like BlobSeer, Btrfs exposes the raw checkpoint image to the host, making it compatible with the use of DMTCP from outside both the VM and the VM kernel driver. Btrfs is a mainstream filesystem (in the mainline Linux kernel since Linx 2.6.29). Both Fedora and Ubuntu are planning for Btrfs to be the default filesystem in late 2013 or later.
An alternative to using a copy-on-write filesystem for snapshots is the use of a stackable filesystem. This was discussed in [VNOS06], with the idea of using UnionFS [WDG06]. They appear not to have implemented it.
DMTCP [AAC09] was chosen for checkpoint-restart due to its recent support for plugins. This eased the job of checkpoint-restart, since other choices would have required modification to the underlying checkpoint-restart package. In addition, DMTCP’s support for forked checkpointing, and for fast restart (based on mmap), were also helpful in demonstrating those features for virtual machine snapshots.
Among other choices for checkpoint-restart, BLCR [HD06] has the longest history among the commonly used checkpoint-restart packages. It is based on a kernel module, and has especially strong support for use with MPI-based checkpoint-restart services and with batch queues. CryoPid2 [O’N] represents an alternative user-space checkpoint-restart package based on using ptrace to control the target application. OpenVZ [Ope] is a kernel-based checkpoint-restart package based on Linux containers. CRIU [CRI] is a recent checkpoint-restart package with an interesting hybrid strategy between user-space and kernel-space approaches. The Linux kernel has been extended to include many interfaces that expose the kernel internals. CRIU uses those interfaces to provide an entirely user-space checkpoint-restart package.
A generic checkpoint-restart mechanism was presented based on the DMTCP checkpoint-restart package. DMTCP can directly checkpoint the user-space QEMU virtual machine. In other cases, where the virtual machine employs a kernel driver, DMTCP relies on an API to transfer driver state between the kernel driver and user space. KVM provides such an API, and so a 200-line DMTCP plugin sufficed to implement checkpoint-restart for KVM/QEMU. Lguest does not provide such an API, and about 40 lines were added to augment the Lguest kernel driver. The estimated development time for developing checkpoint-restart for a new virtual machine is estimated at five person days (where a full kernel driver API is provided, as for KVM), and ten person days (where a full kernel driver API is not provided, as for Lguest).
The method is applicable wherever DMTCP is available. DMTCP currently runs under Linux (x86, x86_64, and ARM). Thin hypervisors may or may not support DMTCP, depending on what features of Linux they support.
The generic mechanism presented assumes a homogeneous architecture (same CPU, same host operating system, same hardware). Future work may consider removing some of those restrictions — especially those of homogeneous hardware. Future work will also explore transparently checkpointing a cluster of virtual machines.
Where the kernel driver API must be extended (Lguest, in our case), an alternative approach was considered. In this approach, all of the system calls from the VM launcher to the VM kernel driver are recorded at the time of launch, and more of those system calls are played back at the time of restart (although possibly in modified form). This may have advantages in being more robust as the VM software evolves. This also is a topic for future work.
The authors gratefully acknowledge the discussions and insights provided by Zhengping Jing.
- [AAC09] Jason Ansel, Kapil Arya, and Gene Cooperman. DMTCP: Transparent checkpointing for cluster computations and the desktop. In 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS-09), pages 1–12, 2009.
- [BDF03] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proc. of 19th ACM symposium on Operating systems principles, SOSP ’03, pages 164–177, New York, NY, USA, 2003. ACM.
- [CLO08] K. Chanchio, C. Leangsuksun, H. Ong, V. Ratanasamoot1, and A. Shafi. An efficient virtual machine checkpointing mechanism for hypervisor-based HPC systems. In High Availability and Performance Computing Workshop (HAPCW), 2008.
- [Col12] Patrick Colp. Xen project code released: VM snapshots. http://vmblog.com/archive/2009/04/22/xen-project-code-released-vm-snapshots.aspx, Accessed Nov. 18, 2012.
- [CRI] CRIU team. Criu. http://criu.org/.
- [DMT12] DMTCP team. DMTCP : Distributed multithreaded checkpointing. http://dmtcp.sourceforge.net, Accessed Nov. 18, 2012.
- [HD06] Paul Hargrove and Jason Duell. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics Conference Series, 46:494–499, September 2006.
- [KVM12] KVM team. Kvm — Qemu. http://wiki.qemu.org/KVM, see also http://www.linux-kvm.org/page/Main_Page, Accessed Nov. 18, 2012.
- [LNP90] Kai Li, Jeffrey F. Naughton, and James S. Plank. Real-time, concurrent checkpoint for parallel programs. In Proc. of Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 79–88, March 1990.
- [LNP94] Kai Li, Jeffrey F. Naughton, and James S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874–879, August 1994.
- [May] Uwe F. Mayer. Linux/Unix nbench. http://www.tux.org/~mayer/linux/bmark.html; retrieved Dec. 4, 2012.
- [McL08] Mark McLoughlin. The QCOW2 image format. http://people.gnome.org/~markmc/qcow-image-format.html, 2008.
- [NAB11] Bogdan Nicolae, Gabriel Antoniu, Luc Bougé, Diana Moise, and Alexandra Carpen-Amarie. BlobSeer: Next generation data management for large scale infrastructures. Journal of Parallel and Distributed Computing, 71(2):168–184, February 2011.
- [NC11] Bogdan Nicolae and Franck Cappello. BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots. In Proc. of 2011 Int. Conf. for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 1–12. ACM, 2011.
- [O’N] Mark O’Neill. Cryopid2. http://sourceforge.net/projects/cryopid2.
- [Ope] OpenVZ team. Openvz. http://wiki.openvz.org/.
- [PEL11] Eunbyung Park, Bernhard Egger, and Jaejin Lee. Fast and space-efficient virtual machine checkpointing. In Proc. of 7th ACM SIGPLAN/SIGOPS Int. Conf. on Virtual Execution Environments, VEE ’11, pages 75–86, New York, NY, USA, 2011. ACM.
- [PXN95] J. S. Plank, J. Xu, and R. H. B. Netzer. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee, August 1995.
- [Qem12] Qemu team. Qemu. http://wiki.qemu.org/Main_Page, Accessed Nov. 18, 2012.
- [RBM12] Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS: The Linux B-tree filesystem. Technical report, IBM Research Report, July 2012. RJ10501 (ALM1207-004); http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf.
- [Rus12] Rusty Russell. Lguest: The simple x86 hypervisor. http://lguest.ozlabs.org/, Accessed Nov. 18, 2012.
- [SB05] Michael H. Sun and Douglas M. Blough. Fast, lightweight virtual machine checkpointing. Technical report, Georgia Institute of Technology, 2005. GIT-CERCS-10-05; http://www.cercs.gatech.edu/tech-reports/tr2010/git-cercs-10-05.pdf.
- [SC11] Vasinee Siripoonya and Kasidit Chanchio. Thread-based live checkpointing of virtual machines. In 10th IEEE Int. Symp. on Network Computing and Applications, 2011.
- [TL01] Douglas Thain and Miron Livny. Multiple bypass: Interposition agents for distributed computing. Cluster Computing, 4(1):39–47, 2001.
- [VNOS06] Geoffroy Vallée, Thomas Naughton, Hong Ong, and Stephen L. Scott. Checkpoint/restart of virtual machines based on xen. In HAPCW’06: High Availability and Performance Computing Workshop, Santa Fe, New Mexico, USA, October 2006. Held in conjunction with LACSI 2006.
- [WDG06] Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, David P. Quigley, Erez Zadok, and Mohammad Nayyer Zubair. Versatility and Unix semantics in namespace unification. ACM Transactions on Storage TOS, 2(1):74–105, 2006.