Are We All on the Same Page? Let's Fix That

Thursday, June 15, 2023 - 2:10 pm3:05 pm

Luis Mineiro, Delivery Hero

Abstract: 

The industry defined as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.

Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.

Adaptive Paging is an alert handler that leverages the causality from tracing and Opentracing/OpenTelemetry's semantic conventions to page the team closest to the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable cause, paging the respective team instead of the alert owner.

The approach enables an effective symptom-based alerting strategy with thresholds derived from the respective operation service level objective.

Luis Mineiro, Delivery Hero

Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. He is passionate about reliability engineering, with an obsession about on-call health and getting rid of false positives. Luis has been with Delivery Hero since 2022 creating a developer platform focused on self-service and automation.

BibTeX
@conference {288297,
author = {Luis Mineiro},
title = {Are We All on the Same Page? Let{\textquoteright}s Fix That},
year = {2023},
address = {Singapore},
publisher = {USENIX Association},
month = jun
}

Presentation Video