Ray Use Cases

Cruise.data: A new dataset processing pipeline for Cruise ML

Ray Summit 2022

At Cruise, we rely on custom data pre-processing before feeding data into ML models. In many cases ML engineers prefer to develop data pre-processing as part of their ML training code, making quick iterations and debugging much easier. This puts high pressure on performance and reliability of data pre-processing, because we need to make sure that by the time an ML model is ready to accept a next mini-batch of data, it is already available. Otherwise we will be wasting GPU time. Some of these data transformations could have run offline and cached, but there is no existing system available which would make it easy to move the logic between training and offline batch data processing jobs. Moreover, the memory usage of the data processing pipeline is a large consideration for us due to the use of high resolution sensors our cars have. In this talk, we will share our progress on building a new ML data pre-processing framework, Cruise.Data, and how Ray helps us to scale it. Cruise.Data is a novel system which combines best properties of tf.data, the PyTorch ecosystem, and large-scale data processing frameworks.

About Alexander

Alexander Sidorov is an engineer with a decade of experience building ML infrastructure. His interests lie in algorithmic work, APIs, and large-scale distributed systems. He is currently an ML infrastructure team lead at Cruise, making sure that Cruise ML infra uses SOTA technology when business needs require it. And when that is not enough, he works on pushing the SOTA boundary forward. In his past experience at Facebook AI, Alexander led Caffe2 inference efficiency efforts and RNN training development. Before that, he was the team lead of ML orchestration system FBLearner Flow and the founder and team lead of company-wide inference service FBLearner Predictor.

About Xuebin

Xuebin (Ann) Yan is a software engineer who has 5+ years of experience in machine learning infra. She worked on both inference and model creation at scale in the past. Her previous model creation pipeline was widely used to build state-of-the-art models efficiently. Then those models were served by the inference infra at scale. She has extensive experience working with machine learning engineers from different AI groups and helping them by building a robust, scalable, and user-friendly infra to make sure they can create new models at ease.

Alexander Sidorov

Sr Staff Software Engineer, Cruise

Xuebin Yan

Staff Software Engineer, Cruise
chucks
Ray Summit 2022 horizontal logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Save your spot
register-bottom-mobile
beanbags

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.