Towards a Fault-Resilient Cloud Management Stack

Project description

Cloud-management stacks have become an increasingly important element in cloud computing, serving as the resource manager of cloud platforms. While the functionality of this emerging layer has been constantly expanding, its fault resilience remains under-studied. This paper presents a systematic study of the fault resilience of OpenStack—a popular open source cloud-management stack. We have built a prototype fault-injection framework targeting service communications during the processing of external requests, both among OpenStack services and between OpenStack and external services, and have thus far uncovered 23 bugs in two versions of OpenStack. Our findings shed light on defects in the design and implementation of state-of-the-art cloud management stacks from a fault-resilience perspective.

People

  • Xiaoen Ju, University of Michigan
  • Prof. Kang G. Shin University of Michigan
  • Livio Soares, IBM T.J. Watson Research Center
  • Kyung Dong Ryu, IBM T.J. Watson Research Center
  • Dilma Da Silva, Qualcomm Research Silicon Valley

Publications

  • Xiaoen Ju, Livio Soares, Kang G. Shin, and Kyung Dong Ryu, Towards a Fault-Resilient Cloud Management Stack, in HotCloud’13
  • Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva, On Fault Resilience of OpenStack, in SOCC’13