Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Wednesday, March 20, 2024 - 11:55 am12:40 pm

Ze Li, Microsoft Azure, and Ryan Huang, University of Michigan

Abstract: 

Cloud scale provides the vast resources necessary to repair and fix failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this talk, we discuss our experiences with gray failure in Microsoft Azure to show its broad scope and consequences with several case studies. We also argue that a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them. We will show how Microsoft Azure applied the differential observability in practice and bridged the gap between different components’ perceptions of what constitutes failures.

Ze Li, Microsoft Azure

Dr. Ze Li is a principal data scientist manager in Microsoft Azure. Currently, he is focusing on using data driven and AI/LLM technologies to enable efficiently and effectively building and operating cloud service, such as safe deployment in large scale system, intelligent anomaly detection and auto-diagnosis through data mining in cloud services. Previously, he worked as data scientist/engineer in Capital One and MicroStrategy, where he provided data driven solutions to improve efficiency in financial services and business intelligent services. He published more than 40 peer review papers in the field of data mining, distributed networks/systems and mobile computing.

Ryan Huang, University of Michigan

Dr. Ryan Huang is an Associate Professor in the EECS Department at University of Michigan, Ann Arbor, where he leads the Ordered Systems Lab. He conducts research broadly in computer systems, with specialties in designing principled methods to improve the reliability and performance of large-scale systems. His work received multiple best paper awards in top conferences. He is a recipient of the NSF CAREER Award and a Meta research award. More information about him can be found at https://web.eecs.umich.edu/~ryanph/

BibTeX
@conference {295069,
author = {Ze Li and Ryan Huang},
title = {Gray Failure: The {Achilles{\textquoteright}} Heel of {Cloud-Scale} Systems},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video