Autopsy of a Cascading Outage from a MySQL Crashing Bug

Tuesday, March 19, 2024 - 11:50 am12:35 pm

Jean-François Gagné, Aiven, and Swetha Narayanaswamy, HubSpot

Abstract: 

Once upon a time, an application query triggered a crashing bug. After automated failure recovery, the application resented the query, and MySQL crashed again. This was the beginning of a cascading failure that led to a full datastore unavailability and some partial data loss.

MySQL stability means that we can easily forget to implement operation best practices like cascading failure prevention and testing of unlikely recovery scenarios. It happened to us and this talk is about how we recovered and what we learned from this situation.

Come to this talk for a full post-mortem of a cascading outage caused by a crashing Bug. This talk will not only share the incident operational details, but will also include what we could have done differently to reduce its impacts (including avoiding data loss), and what we changed in our infrastructure to avoid this from happening again (including cascading failure prevention).

Jean-François Gagné, Aiven

Jean-François is a System / Infrastructure Engineer currently working as a MySQL Open Source Developer in Aiven’s Open Source Program Office (OSPO). Before that, his missions were improving operations and scaling MySQL and MariaDB infrastructures at HubSpot, MessageBird and Booking.com. J-F is also the maintainer of Planet for the MySQL Community: a news aggregator for the MySQL Ecosystem. Before being involved with MySQL, he worked as a System / Network / Storage Administrator in a Linux and VMWare environment, as an Architect for a Mobile Telco Service Provider, and as a C & Java Programmer in an IT Service Company. Even before that, while learning computer science, Jeff studied Cache and Memory Consistency in Distributed Systems and Network Group Communication Protocols (yes, the same as Group Replication).

Swetha Narayanaswamy, HubSpot

Swetha Narayanaswamy is Director, Engineering leading the Data Infrastructure team at Hubspot. The HubSpot application platform is made up of over 15,000 components that are deployed 3,000+ times a day. Our systems make hundreds of billions of requests per day to HBase, Kafka, Elasticsearch and MySQL. Prior to Hubspot, Swetha held leadership roles at a variety of Infrastructure companies including Netapp, Microsoft and EMC bringing innovative services to market in a high-growth environment. In addition, Swetha is a non-profit Board member, Diversity advocate and Patent holder.

BibTeX
@conference {295051,
author = {Jean-Fran{\c c}ois Gagn{\'e} and Swetha Narayanaswamy},
title = {Autopsy of a Cascading Outage from a {MySQL} Crashing Bug},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}