Automating Disaster Recovery: The Ultimate Reliability Challenge

Wednesday, March 20, 2024 - 11:55 am12:40 pm

Ricard Bejarano, Cisco Systems Inc.

Abstract: 

Here's how I explain my job to non-techies: if a meteor struck our servers, it's on my team to fix it. But what if it did? Realistically, what would happen if a meteor struck your datacenter?

Here's the story of a vision, one to fully automate disaster recovery away, how I pushed back on it claiming it was impossible, and how we still executed on it to great success.

Ours is also a case study on why looking at these wide surface problems through the sociotechnical lens will set you up for success in places where you could've never anticipated.

So if a metaphorical meteor hit our datacenter, we would just press our metaphorical big red button.

Ricard Bejarano, Cisco Systems Inc.

Ricard is a Lead Site Reliability Engineer at ThousandEyes' SRE team. His background is mostly networking, observability, incident management, infrastructure automation and hunting down the weirdest of bugs. He has captained the execution on our vision to fully automate disaster recovery away.

BibTeX
@conference {295083,
author = {Ricard Bejarano},
title = {Automating Disaster Recovery: The Ultimate Reliability Challenge},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video