Tradeoffs in compressing virtual machine checkpoints

Project description

Checkpoint replication is a prevelant approach to maintaining VM availability even in the case of hosting machine faiures. When a machine failure occurs, the affected VMs resume execution from their respective latest checkpoints in other healthy hosts with sufficient resources. Checkpoint replication is generally applicable in a wide variety of computing environments. However, this approach protects VMs at the expense of a significant amount of network traffic. For example, if a checkpoint is taken every 25 ms, replication can consume more than 3000 Mb/s of network bandwidth for a single VM. If multiple VMs are to be protected at the same time, even dedicated GbE links cannot meet the aggregate bandwidth required for checkpoint replication.

 

Reducing checkpoint replication traffic is key to using this approach for creating highly available VMs. One way of reducing network traffic is to “compress” checkpoints before sending them over the network. While several compression techniques are available, they have not been compared systematically under different workloads and operating conditions. The primary goal of this project is to thoroughly conduct such comparative evaluation of compression methods, and provide insights that could guide selection decisions.

We proposed similarity compression, a novel method that uses VM similarities to eliminate redundant traffic. We evaluate it along with gzip and delta compression, two popular methods used to reduce checkpoint traffic in prior work. We characterize the methods based on their effectiveness in reducing traffic as well as on the overheads incurred by them. Our evaluation is based on a framework consisting of a pair of checkpoint sender and receiver programs that emulates the operation of a real high availability (HA) system. We evaluate the compression methods using checkpoints obtained from four different workloads, chosen from categories frequently seen in HA systems— server workloads that constantly interact with external clients, and long-running computation jobs.

Our results show that compression methods we considered are effective for different types of workload and ranges of checkpointing frequency. They also have distinct resource requirements. gzip substantially reduces checkpoint traffic, but incurs prohibitively high CPU overheads. It also has the longest checkpoint transfer time, and thus is not preferred for interactive, latency-sensitive applications. Delta compression has a high memory penalty, and its effectiveness depends on a sufficiently large cache of transmitted dirty pages. We observed that the required cache size varies widely with workloads and checkpointing intervals, and hence needs to be tuned for effective use.

Similarity compression reduces checkpoint traffic by removing redundant content within each VMcheckpoint, as well as between checkpoints obtained from multiple VMs on a host. Our evaluation results show that similarity compression uses both CPU and memory efficiently, and incurs short transfer times in most cases. It is most effective for workloads involving collaboration between components and sharing of common code and data, especially when checkpointed at longer intervals. However, it achieves more modest traffic reductions than the other two methods for other types of workload.

 

People

  • Karen Kai-Yuan Hou, University of Michigan
  • Prof. Kang G. Shin, University of Michigan
  • Yoshio Turner, HP Labs
  • Sharad Singhal, HP Labs

 

Project sponsors

  • HP Labs

 

Publications

  • Kai-Yuan Hou, Kang G. Shin, Yoshio Turner and Sharad Singhal. Tradeoffs in Compressing Virtual Machine Checkpoints. The 7th International Workshop on Virtualization Technologies in Distributed Computing (VTDC’13) June 18, 2013, New York, NY, USA.