Product Reliability for Google Maps

Monday, March 18, 2024 - 11:00 am11:45 am

Micah Lerner and Joe Abrams, Google

Abstract: 

As our organization has gotten very good at protecting server SLOs with reliability best practices like scaling globally distributed at-scale architectures, toil mitigation, and continuous reliability improvements we noticed that a majority of incidents impacting our end-users were not showing up as an SLO miss.

In many cases these outages were not even observable from the server side - for example, the rollout of a new version of the consumer mobile application (that our services powers) to an app store could break one or more critical feature(s) due to bugs in client code. This reality has led to a change in the way we approach reliability - we’re shifting our focus from server reliability to product reliability.

We’re not yet finished with the transition, but we’re starting to see very positive results. Our talk shares challenges we've solved so far, lessons we've learned, and our vision for the future.

Micah Lerner, Google

Micah Lerner is a tech lead at Google, focused on consumer Geo products. Previously, Micah helped build the Geospatial datasets powering Mapbox and was an early employee at Strava (where he first read Google's book on SRE).

Joe Abrams, Google

Joe leads site reliability engineering for Google Maps products. He and his team are constantly looking for new ways to protect users from potential production issues. As a self-professed outage nerd, he enjoys hearing about interesting failure tales from inside and outside of Google. When he is not poring over last month's postmortem reports, you can find him on a tennis court trying to make his serve more fault-tolerant.

BibTeX
@conference {295019,
author = {Micah Lerner and Joe Abrams},
title = {Product Reliability for Google Maps},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video