Whale: Efficient Giant Model Training over Heterogeneous GPUs

Authors: 

Xianyan Jia, Le Jiang, Ang Wang, and Wencong Xiao, Alibaba Group; Ziji Shi, National University of Singapore & Alibaba Group; Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin, Alibaba Group

Abstract: 

The scaling up of deep neural networks has been demonstrated to be effective in improving model quality, but also encompasses several training challenges in terms of training efficiency, programmability, and resource adaptability. We present Whale, a general and efficient distributed training framework for giant models. To support various parallel strategies and their hybrids, Whale generalizes the programming interface by defining two new primitives in the form of model annotations, allowing for incorporating user hints. The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution. Whale further introduces a novel hardware-aware parallel strategy, which improves the performance of model training on heterogeneous GPUs in a balanced manner. Deployed in a production cluster with 512 GPUs, Whale successfully trains an industry-scale multimodal model with over ten trillion model parameters, named M6, demonstrating great scalability and efficiency.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280776,
author = {Xianyan Jia and Le Jiang and Ang Wang and Wencong Xiao and Ziji Shi and Jie Zhang and Xinyuan Li and Langshi Chen and Yong Li and Zhen Zheng and Xiaoyong Liu and Wei Lin},
title = {Whale: Efficient Giant Model Training over Heterogeneous {GPUs}},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
isbn = {978-1-939133-29-57},
address = {Carlsbad, CA},
pages = {673--688},
url = {https://www.usenix.org/conference/atc22/presentation/jia-xianyan},
publisher = {USENIX Association},
month = jul
}

Presentation Video