Taming Spiky Log Volumes: Maintaining Real-Time Log Accessibility with Kaldb

Thursday, June 15, 2023 - 9:00 am9:55 am

Suman Karumuri

Abstract: 

Logs, much like currency, are subject to decreasing value over time. Observability teams face the challenge of ensuring high availability of recent logs, especially during incidents or deployments. Traditional log search systems struggle to auto scale cost-effectively and respond fast enough during an incident. In this session, we will discuss how Slack tackles spiky log volumes in ElasticSearch, from detecting log spikes at various stack layers to handling them using rate limiting, quotas, sampling, and back fills.

While these techniques help, they may result in data loss. Therefore, we will delve into the automation of handling log spikes using Kaldb, an open-source log search engine. We will explore trade-offs to minimize data loss, such as prioritizing the ingestion of fresh data over older data while auto-scaling. Powered by Lucene and OpenSearch, Kaldb allows Slack to prioritize fresh log data and rapidly scale capacity within a Kubernetes-based architecture.

Suman Karumuri[node:field-speakers-institution]

Suman Karumuri is a Principal Software Engineer and the tech lead for Observability at Airbnb. As an expert in distributed tracing, Suman has been a tech lead of Zipkin and a co-author of the OpenTracing standard, a Linux Foundation project under the CNCF. With extensive experience, Suman has spent years building and operating petabyte-scale log search, distributed tracing, and metrics systems at notable companies like Slack, Pinterest, Twitter, and Amazon. In his leisure time, Suman enjoys engaging in board games, exploring the outdoors through hiking, and spending quality time with his children.

BibTeX
@conference {288301,
author = {Suman Karumuri},
title = {Taming Spiky Log Volumes: Maintaining {Real-Time} Log Accessibility with Kaldb},
year = {2023},
address = {Singapore},
publisher = {USENIX Association},
month = jun
}

Presentation Video