It Is OK to Be Metastable

Wednesday, March 20, 2024 - 9:50 am10:35 am

Aleksey Charapko, University of New Hampshire

Abstract: 

Metastable failures are self-perpetuating performance failures characterized by the positive feedback loop that keeps systems in a degraded state. For a system to enter a metastable failure state, it first needs to be in a metastable vulnerable state in which some event triggers an overload condition and starts the feedback mechanism. A naïve way to dodge metastable failures is to avoid operating in the metastable vulnerable state, precluding the ""trigger, overload, feedback loop"" failure sequence. Unfortunately, avoiding the metastable vulnerable state is a moot solution; in some cases, this is simply impossible, and in others, it leads to running systems with a high degree of overprovisioning, resulting in poor resource utilization and high cost.

In this talk, I will discuss why it is OK to be in a metastable vulnerable state and what strategies we can use to mitigate the risk of developing a metastable failure. I will present three cornerstones of metastable failure risk mitigation for large systems. The first one is understanding the environments, algorithms, and workloads. The second and third cornerstones — metastable failure trigger-resistant design and protection of vulnerable components — build on the insight developed in the understanding phase.

Aleksey Charapko, University of New Hampshire

Aleksey Charapko is an assistant professor at the University of New Hampshire. He broadly works at the intersection of performance, reliability, and efficiency of distributed systems. Before settling for an academic career, Aleksey had a nearly decade-long engineering career, ranging from freelance and consulting to working in big tech.

BibTeX
@conference {295065,
author = {Aleksey Charapko},
title = {It Is {OK} to Be Metastable},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video