Optimizing Resilience and Availability by Migrating from JupyterHub to the Kubeflow Notebook Controller

Monday, March 18, 2024 - 4:20 pm4:40 pm

David Hoover and Alexander Perlman, Capital One

Abstract: 

This presentation details our transition from JupyterHub to the Kubeflow Notebook Controller.

JupyterHub was architected in a backend agnostic way that "supports" Kubernetes but isn't truly Kubernetes-native. As a result, it has significant shortcomings with respect to resilience and high availability. In particular, the core component, the hub API, can only have one replica at any given time.

In contrast, The Kubeflow Notebook controller is built from the ground up to be Kubernetes native using the operator pattern. There's far less complexity, fewer components, less brittleness, and improved resilience and high availability.

As a result, our platform has been able to scale to four times as many users, including ten times as many concurrent executions. Our users are happier and there's less operational overhead for platform engineers. Our journey illustrates how properly leveraging Kubernetes-native architecture confers significant benefits.

David Hoover, Capital One

David is a Sr. Lead DevOps Engineer at Capital One. He works on an enterprise-scale Machine Learning Platform to facilitate superior outcomes for Data Scientists and machine learning engineers. His professional interests include Docker, Cybersecurity, Python and Kubernetes and he spends his free time listening to heavy metal and as a cinema-phile.

Alexander Perlman, Capital One

Alexander Perlman is a senior lead software engineer at Capital One's Machine Learning Experience organization. His areas of focus include distributed compute and workflow orchestration. He lives in the NYC metro area with his wife and three young children. He believes that the correct pronunciation of "kubectl" is "kube-control," not "kube-cuddle." His favorite bubble tea flavor is taro.

BibTeX
@conference {295029,
author = {David Hoover and Alexander Perlman},
title = {Optimizing Resilience and Availability by Migrating from {JupyterHub} to the Kubeflow Notebook Controller},
year = {2024},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video