Kube, Where’s My Metrics? The Challenges of Scaling Multi-Cluster Prometheus

Tuesday, March 19, 2024 - 2:40 pm3:25 pm

Niko Smeds and Iain Lane, Grafana Labs

Abstract: 

Service and systems monitoring is crucial for healthy, happy applications. For small to medium-sized Kubernetes environments, a single Prometheus HA pair is often sufficient. But when you scale your environments, new challenges arise. As Prometheus ingests and stores significantly more time-series metrics, it can struggle to keep up. And if cluster numbers increase, it becomes difficult to locate the metrics you need. These are known problems with a few common solutions.

In this talk, Niko and Iain will discuss the challenges of scaling Prometheus by walking you through the history of monitoring Grafana Labs' internal services. As they scaled to 40+ clusters over five cloud providers, the Grafana Labs teams frequently iterated on their internal monitoring architecture. They'll cover the basics of Prometheus for monitoring and alerting before moving into remote write and monitoring of multiple clusters.

Niko Smeds, Grafana Labs

Niko is a senior software engineer at Grafana Labs, where he helps build and monitor the Kubernetes and cloud platforms. From OpenStack to K8s, he has experience with both private and public cloud infrastructure.

Iain Lane, Grafana Labs

Iain is a senior software engineer at Grafana Labs. A member of the Cloud Platform team, his focus is on maintaining the infrastructure - Kubernetes clusters - which runs Grafana Cloud, helping build tools and processes for engineers to deploy their software into this environment with maximum confidence.

BibTeX
@conference {295043,
author = {Niko Smeds and Iain Lane},
title = {Kube, {Where{\textquoteright}s} My Metrics? The Challenges of Scaling {Multi-Cluster} Prometheus},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}