At Ant Group, we have built various kinds of distributed systems on top of Ray, and deployed them in production with large scales. In this talk, we'll be covering the problems we've met, and the improvements we've made that make Ray an industry-level system with high scalability and stability.
Hao Chen is a staff engineer at Ant Group, where he focuses on distributed computing systems. He leads Ant’s Ray core engineering team, which is responsible for improving Ray’s functionalities, performance, reliability and scalability, as well as supporting Ant’s various Ray applications. He is also a Ray committer, and has contributed many features and architecture enhancements to Ray’s open source community, including Java API, actor fault tolerance, gRPC, GCS service, etc.