The objective of this project is to protect virtual machines (VM) from failures of their hosting physical machines while utilizing computing resources efficiently.
Existing approaches for VM high availability set up a backup VM instance for a protected (primary) VM. The backup VM is maintained in a different physical node (referred to as the backup host) than the node in which the primary VM is running (referred to as the primary host), and is kept synchronized with the state of the primary VM. When the primary host goes down unexpectedly, for example, because of a power failure, the backup VM takes over execution promptly, minimizing the interruption of VM operation caused by the failure.
However, such approaches have a high resource cost, especially a high cost of memory resource consumption. Using existing approaches, to protect a VM configured with 1G memory, another 1G memory space has to be reserved in a backup host for maintaining the backup VM. This backup memory reservation degrades resource efficiency, because the backup VM is just a passive image of the primary VM — it does not operate and does not contribute to overall system performance and throughput.
We propose HydraVM to provide VM high availability at a low resource cost. Instead of keeping the backup VM instances in physical RAM, HydraVM maintains redundant VM images in a shared storage, which is commonly deployed in a virtualized environment. HydraVM eliminates the backup memory reservation required by conventional VM high availability approaches. Using HydraVM, all physical memory resources can be utilized by active VMs; no memory resources are held up by passive VM images.
HydraVM keeps track of the state of the primary VM by taking continuous, incremental checkpoints of the primary. A complete, recent checkpoint of the primary VM is stored in the shared storage, and when a failure occurs to the primary host, the failed VM can be restored in other healthy hosts based on its most recent checkpoint. Our VM checkpointing mechanism is enhanced with a copy-on-write technique, and we propose to do a slim VM restore which is able to bring up a failed VM from its recent checkpoint in the shared storage in about 1.5 seconds of time.
- Karen Kai-Yuan Hou, University of Michigan
- Prof. Kang G. Shin, University of Michigan
- Mustafa Uysal, VMware
- Arif Merchant, Google
- Sharad Singhal, HP Labs
- HP Labs
- Kai-Yuan Hou, Mustafa Uysal, Arif Merchant, Kang G. Shin, and Sharad Singhal. HydraVM: Low-cost, transparent high availability for virtual machines. Technical report, HP Labs, 2011.