ML Infra + Apps

Large-scale deep learning training and tuning with Ray at Uber

Ray Summit 2022

At Uber, machine learning powers many business decisions to bring about experiences that delight our customers. To empower our product development teams with ML technology, we have built the Michelangelo Machine Learning Platform as an end-to-end solution for standardizing ML application development from exploration to production. In Michelangelo's early years, we centered our technology stack on Apache Spark, a large-scale data processing and classical ML training platform. Recent mass adoption of deep learning by our product teams has driven us to rethink our Spark-based technical architecture. Compared to classical ML techniques such as linear/logistics regression and tree-based methods, deep learning poses significant new infrastructure challenges in compute, network, and storage. In this talk, we will present case studies to highlight some of the specific challenges in large-scale deep learning training, and how we address them by leveraging Ray, a distributed compute platform.

About Xu

Xu Ning, PhD, is a senior engineering manager at Uber leading multiple software development teams in Uber's Michelangelo Machine Learning Platform. His teams are major contributors to Uber's open-source deep learning software: Horovod distributed training framework (horovod.ai), Petastorm data access library (github.com/uber/petastorm), and Neuropod unified inference library (neuropod.ai). Before Michelangelo, he led Uber's Cherami distributed task queue and the Hadoop observability and data security teams. Prior to Uber, he held engineering positions at Facebook, Akamai, and Microsoft. His interests are machine learning, deep learning, MLOps, big data, and infrastructure engineering. His latest work at Uber is published at https://eng.uber.com/author/xu-ning/.

About Michael

Michael Mui is a staff software engineer on Uber's Machine Learning Platform team, working on distributed training infrastructure, hyperparameter optimization, model representation, and evaluation.

About Di

Di Yu is a machine learning system engineer focused on large-scale distributed training systems that enable modern deep learning models to run with high throughput and reliability. Before joining Uber, Di worked as an ML system engineer at Facebook for three years, and before he worked at ServiceNow on distributed systems.

Xu Ning

Senior Engineering Manager, Uber

Michael Mui

Staff Software Engineer, Uber AI

Di Yu

Sr. Software Engineer, Uber Technologies, Inc.
chucks
Ray Summit 2022 horizontal logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Save your spot
register-bottom-mobile
beanbags

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.