Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA

Authors: 

Zhe Wang, Shanghai Jiao Tong University; Teng Ma, Alibaba Group; Linghe Kong, Shanghai Jiao Tong University; Zhenzao Wen, Jingxuan Li, Zhuo Song, Yang Lu, Yong Yang, and Tao Ma, Alibaba Group; Guihai Chen, Shanghai Jiao Tong University; Wei Cao, Alibaba Group

Abstract: 

Cloud services have recently undergone a major shift from monolithic designs to microservices running on the cloud-native infrastructure, where monitoring systems are widely deployed to ensure the service level agreement (SLA). Nevertheless, the traditional monitoring system no longer fulfills the demands of cloud-native monitoring, which is observed from our practical experience in Alibaba cloud. Specifically, the monitor occupies resources (e.g., CPU) of the monitored infrastructure, disturbing services running on it. For example, enabling monitor causes jitters/declines of online services in Alibaba's ''double eleven'' shopping festival with high loads. On the other hand, the quality of service (QoS) of monitoring itself, which is vital to track and ensure SLA, is not guaranteed with the high loaded system.

In this paper, we design and implement a novel monitoring system, named Zero, for cloud-native monitoring. First, Zero achieves zero overhead to collect raw metrics from the monitored hosts using \textit{one-sided} remote direct memory access (RDMA) operations, thus avoiding any interferences to cloud services. Second, Zero adopts receiver-driven model to collect monitoring metrics with high QoS, where credit-based flow control and hybrid I/O model are proposed to mitigate network congestion/interference and CPU bottlenecks. Zero has been deployed and evaluated in Alibaba cloud. Deployment results show that Zero achieves no CPU occupation at the monitored host and supports 1 sim 10k hosts with 0.1 sim 1s sampling interval using single thread for network I/O.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280728,
author = {Zhe Wang and Teng Ma and Linghe Kong and Zhenzao Wen and Jingxuan Li and Zhuo Song and Yang Lu and Guihai Chen and Wei Cao},
title = {Zero Overhead Monitoring for Cloud-native Infrastructure using {RDMA}},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
isbn = {978-1-939133-29-33},
address = {Carlsbad, CA},
pages = {639--654},
url = {https://www.usenix.org/conference/atc22/presentation/wang-zhe},
publisher = {USENIX Association},
month = jul
}

Presentation Video