Ray Summit 2022
Shuffle is a key primitive in large-scale data processing applications. The difficulty of large-scale shuffle has inspired a myriad of implementations. While these have greatly improved shuffle performance and reliability over time, it comes at a cost: flexibility. We show that contrary to the popular wisdom, shuffle can be implemented with high performance and reliability on a general-purpose system for distributed computing: Ray. In this talk we present Exoshuffle, an application-level shuffle system that outperforms Spark and achieves 82% of theoretical performance on a 100TB sort on 100 nodes. In Ray 2.0, we have integrated Exoshuffle with the Datasets library to provide high-performance large-scale shuffle for ML users.
Stephanie Wang is a PhD student in distributed systems at UC Berkeley, a software engineer at Anyscale, and a lead committer for the Ray project. Currently, she's working on problems such as fault tolerance and distributed memory management. She is generally interested in the problem of making general-purpose distributed programming possible and in designing fast and reliable distributed systems.
Jiajun Yao is a software engineer at Anyscale and a committer for the Ray project. He is interested in making distributed computing easily accessible to everyone. Before joining Anyscale, Jiajun was a software engineer at LinkedIn building the graph database.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.
Save your spot