ML Infra + Apps

Alpa: Simple large model training and inference on Ray

Ray Summit 2022

Alpa is a Ray-native library built for automatically training and serving large models (e.g., GPT-3). Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on this, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually designed plans. In this talk, we will focus on both the algorithm side and also the engineering/system implementation side, where Ray is a crucial building block of the Alpa runtime.

About Hao

Hao Zhang is a postdoc researcher at UC Berkeley working with Ion Stoica. He completed his PhD at CMU. His research interests are in the intersection of machine learning and systems, with the focus on improving the performance and ease of use of today's distributed ML systems. Hao's research has been recognized with an NVIDIA pioneer research award at NeurIPS'17, and the Jay Lepreau best paper award at OSDI'21. Hao's open-source artifacts have been used by organizations such as AI2, Meta, and Google, and parts of Hao's research have been commercialized at multiple startups, including Petuum and Anyscale.

Hao Zhang

Postdoc, UC Berkeley
Ray Summit 2022 horizontal logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Save your spot

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.